Feature Engineering: Feature Selection
The initial and fundamental stage of developing a GeoAI model
Halo semuanya,
A paradigm shift occurred in geoscience branches like geographic information science and remote sensing. The focuses have shifted from algorithmic approaches using mainly deterministic models for geospatial analysis to the development of heuristics, supported, for example, by Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL). This statement is my conclusion after various insights during the lectures and with professors on campus.
This transition makes us flexible in identifying patterns from data without having to follow through a rigid set of rules. In this context, AI can develop and induce insights from data, not unlike human intuition in the process of decision-making. While the results are usually not 100% perfect, there is justification that explains these imperfections. But that is just exactly where the beauty lies.
I might start content on this blog with the theme of Geospatial Artificial Intelligence (GeoAI). I don’t think it will be a series like #BelajarGEE. Besides being easy to adjust, GEE can be integrated with GeoAI through Python, as shown in content from my friend Yakub Hariana. As is commonly done, data processing should begin with pre-processing, and one important step is Feature Engineering.
In the era of big data today, one aspect of feature engineering — namely feature selection — can greatly assist our work. When preparing high-dimensional data for GeoAI, whether for Machine Learning (ML) or Deep Learning (DL), feature selection helps reduce noise, and eliminate redundant and irrelevant features, which ultimately decreases computational load and improves model accuracy. The assumption is that by processing only the most relevant features, the AI model becomes better at interpretation, prediction, and preventing overfitting [1] [2].
To give a brief example, let’s say we want to perform multispectral classification using an image with 12 bands. Not all features (in this case, the bands of the image) are necessarily significant for predictions made by the AI model. The irrelevant features are removed to improve the efficiency or accuracy of the model. This aligns with what American mathematician Richard Ernest Bellman said about the Curse of Dimensionality in 1961. We are experiencing the effects of this now due to the exponential growth of data. I believe the compromise lies in managing data more efficiently.
Preparing Training Samples
Download the Required Materials
The materials I use for performing feature selection are twofold: raster imagery and training samples for labeling the raster data. To make it easier, I’m using data from my previous content. Download the materials below:
- Sentinel-2 data
- Training sample for labeling the data
Convert Training Samples to Raster
This step can be done using QGIS. Here’s how:
- Convert the projection system of the vector-format training sample (the one I used has a .geojson extension) to UTM.
- Rasterize it using the settings shown below.
- Reproject the raster back to the geographic coordinate system
The training samples used represent land cover classes, with each class labeled by a number from 1 to 7, corresponding to the following object names in order: water, built-up land, agricultural vegetation, vegetation, mangrove, low vegetation, and clouds. When converted to raster format, these numbers change and are recognized as raster values.
Preparing the Anaconda environment
We will use Python to process the data from here. There are no restrictions on the Integrated Development Environment (IDE) that can be used; it can be cloud-based like Google Colab or Kaggle or desktop-based like Jupyter Notebook, PyCharm, and Visual Studio Code (VSCode) with Jupyter and Python extensions. However, so far, I feel most comfortable using VSCode.
I use Anaconda because it comes with many pre-installed Python libraries and packages. After installing Anaconda, the next step is to create a virtual environment to avoid conflicts between different versions of libraries and Python packages. To create and activate it, open the Anaconda Prompt. If you’re using Windows, search for ‘Anaconda Prompt’ in the Start menu. Then, enter the following command if you want to create a virtual environment named ‘geo’.
conda create -n geo python=3.12
conda activate geo
Make sure with ctrl+shift+P, then Select Interpreter and choose the previously created virtual environment.
Create a Notebook File and Organize the Folder Structure
Place the training sample and image files in the ‘Data’ folder, which should be inside the main project folder. Then, create a Python notebook file with the .ipynb extension inside it. To do this, right-click on the project folder and select ‘New File’.
Convert training sample into CSV
We need to convert the training samples into a tabular format, as most Python libraries work with this format. This type of data is also referred to as a Python dataframe when it has column headers, similar to attribute tables in vector data. Meanwhile, a matrix or Python array (which can be 2D or multidimensional) represents raster data. Why is this the case? Because Python libraries, including those for Machine Learning (ML) such as Scikit-learn, TensorFlow, and PyTorch, are designed to be compatible with tabular data manipulation [3] [4]. The most common libraries for handling Python arrays are Numpy, Pandas for dataframes, and Geopandas for geospatial dataframes.
Python libraries used:
rasterio
: Recognizes raster geospatial data, including resampling (changing spatial resolution) and changing projection systems.numpy
: Recognizes and manipulates Python arrays.pandas
: Manipulates Python dataframes.
import rasterio
from rasterio.enums import Resampling
from rasterio.warp import reproject, calculate_default_transform
from rasterio.mask import mask
import numpy as np
import pandas as pd
def training_sample(raster_path, roi_path, output_csv_path):
# Load raster data
raster = rasterio.open(raster_path)
# Load training sample
roi = rasterio.open(roi_path)
# Calculate the transform and dimensions for the resampled roi
transform, width, height = calculate_default_transform(roi.crs, raster.crs, raster.width, raster.height, *raster.bounds)
# Create an array to hold the resampled data
roi_resampled = np.empty(shape=(height, width), dtype=roi.dtypes[0])
# Reproject the roi to match the raster data
reproject(
source=rasterio.band(roi, 1),
destination=roi_resampled,
src_transform=roi.transform,
src_crs=roi.crs,
dst_transform=raster.transform,
dst_crs=raster.crs,
resampling=Resampling.nearest
)
# Create a mask from the resampled roi
roi_mask = roi_resampled != roi.nodata
# Apply the mask to the raster data
raster_data = raster.read()
raster_masked = np.where(roi_mask, raster_data, np.nan)
# Stack the layers
stacked_data = np.vstack([roi_resampled[np.newaxis, ...], raster_masked])
# Convert to DataFrame, set the class and band names (depend on what raster data that you use, this case is for Sentinel-2)
df = pd.DataFrame(
stacked_data.reshape(stacked_data.shape[0], -1).T,
columns=["class", "B2", "B3", "B4", "B5", "B6", "B7", "B8", "B8A", "B9", "B11", "B12"])
# Filter out unwanted class values and ensure 'class' column is integer
valid_classes = [1, 2, 3, 4, 5, 6, 7]
df = df[df['class'].isin(valid_classes)]
# Convert 'class' column to integer
df['class'] = df['class'].astype(int)
# Remove rows with missing values
df_clean = df.dropna()
# Save to CSV
df_clean.to_csv(output_csv_path, index=False)
print(f"CSV file created succesfully! {df_clean}")
# Run the function
training_sample(
raster_path=r"....Your Path\Data\s2_median.tif",
roi_path=r"....Your Path\Data\Training sample Lampung_gcs.tif",
output_csv_path=r"....Your Path\Data\1. training_data_new2.csv"
)
The output CSV will have a column named class
representing the land cover class, followed by the names of the bands from Sentinel-2.
Feature selection process
There are several methods of feature selection; this time, I’ll try the most common: Boruta and Recursive Feature Elimination. These two methods are considered efficient, flexible in integration with various algorithms, and allow a reduction of overfitting. Of course, we have to make compromises. For instance, Boruta selects the features that are likely relevant but are quite resource-consuming. On the other hand, RFE is lighter but may remove important features [5] [6].
Import the Boruta library
Let’s start with using Boruta. To import the library, use the following command. If you do not have it installed on your computer, open the Anaconda prompt and use the command conda install boruta
to download it. In this step, we will import Boruta and RandomForestClassifier
from Scikit-learn.
import sklearn
import boruta
from sklearn.ensemble import RandomForestClassifier
Load training sample (.csv) that was previously created.
sample_path = r"....Your Path\Data\1. training_data.csv"
df = pd.read_csv(sample_path)
print(f"Training Data Prop {df.head()}", f"Training Data{df.info}")
A little tip to quickly get the directory of a file in VSCode, right-click the file in question and select Copy Path.
Split data into training and test sets
As is common in machine learning, we need to split the sample data into two sets: one for training and one for testing. Here’s a brief explanation:
x = df.iloc[:, 1:]
: This selects all columns from the dataframe (training sample) except the first column, which contains the class labels.y = df.iloc[:, 0]
: The first column of the dataframe is used as the target.
from sklearn.model_selection import train_test_split
x = df.iloc[:, 1:]
y = df.iloc[:, 0]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)
print(f"X train {x_train}")
print(f"X test {x_test}")
print(f"Y test {y_test}")
print(f"Y train {y_train}")
Feature Selection with Boruta
This step involves configuring feature selection with Boruta. I want to be a bit more aggressive by adjusting some parameters, such as:
max_depth = 10
: A larger tree depth, with the assumption that it can capture more complex patterns.alpha = 0.01
: Tightening the selection criteria, meaning only the most significant features will be selected. The risk is that some features, which may be relevant, might be overlooked due to the stricter criteria.max_iter = 50
: The maximum number of iterations in the feature selection process.
This aggressive configuration is not always necessary, considering its consequences, such as increased computational load and the risk of overfitting.
from boruta import BorutaPy
# Numpy file type, in case you have a newer version of Numpy
np.int = np.int32
np.float = np.float64
np.bool = np.bool_
# max_depth: A deeper tree might capture more complex patterns, but it could also lead to overfitting.
rf = RandomForestClassifier(n_jobs=1, class_weight='balanced', max_depth=10)
# Define boruta selection method
boruta_selector = BorutaPy(
rf,
n_estimators = 'auto',
verbose = 2,
random_state = 1,
alpha = 0.01,
max_iter = 50 # Number of iterations, in case you want to do more 'aggresive' approach
)
# Find all relevant features - 5 features should be selected - Convert training data to numpy arrays
x_train2 = np.array(x_train)
y_train2 = np.array(y_train)
boruta_selector.fit(x_train2, y_train2)
print(f"Ranking: {boruta_selector.ranking_}")
print(f"Nr. of significant features: {boruta_selector.n_features_}")
The process will show this kind of report in the console that explains the result of iteration and the number of relevant features selected.
Save the training and test data
Use the following command to save the results of feature selection with Boruta.
# Feature selection
feature_index = boruta_selector.ranking_.astype(bool)
# Save training data
selected_train = pd.concat([x_train.loc[:, feature_index], y_train], axis=1)
selected_train.to_csv(r"....Your Path\Data\2. selected_features_train_data_boruta.csv", index=False)
# Save test data
selected_test = pd.concat([x_test.loc[:, feature_index], y_test], axis=1)
selected_test.to_csv(r"...Your Path\Data\3. selected_features_test_data_boruta.csv", index=False)
Feature selection with RFE
The steps are similar to Boruta, but the difference lies in the RFE configuration. You should still use the training and test data that has already been split for the configuration below. Additionally, we will print the names and count of the features that have been successfully selected in the console.
The description of the RFE configuration used is as follows:
estimator
: Just like in Boruta, useRandomForestClassifier
it as the base model.step
: The number of features to remove at each iteration.cv (5)
: UseStratifiedKFold
with 5 folds for cross-validation.scoring
: The metric used to evaluate the model’s performance, in this case, is‘accuracy’
.min_features_to_select
: The minimum number of features to select.
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
# Configure RFE selector
selector = RFECV(
estimator=RandomForestClassifier(n_jobs=1, class_weight='balanced', max_depth=10),
step = 1 ,
cv = StratifiedKFold(5),
scoring='accuracy',
min_features_to_select=10,
n_jobs = 1
)
selector = selector.fit(x_train, y_train)
# Get selected feature names and number
feature_names = selector.get_feature_names_out()
feature_number = selector.n_features_
# Print feature name and number in the console
print(f"Selected features {feature_names}")
print(f"Number of selected features {feature_number}")
The result will look like this. Compare it with the training data that hasn’t gone through feature selection. Pay attention to whether any Sentinel-2 bands were eliminated during this process.
Save the training and test data
Use the following command to save the results of feature selection with RFE.
# Save filtered training data
pd.concat([x_train[feature_names], y_train], axis=1).to_csv(
r"...Your Path\Data\2. selected_features_train_data_rfe.csv",
index=False
)
# Save filtered test data
pd.concat([x_test[feature_names], y_test], axis=1).to_csv(
r"...Your Path\Data\3. selected_features_test_data_rfe.csv",
index=False
)
To make sure…
Both Boruta and RFE will generate this kind of CSV or dataframe.
References
[1] Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R. P., Tang, J., & Liu, H. (2017). Feature selection. ACM Computing Surveys, 50(6), 1–45. https://doi.org/10.1145/3136625
[2] Venkatesh, B., & Anuradha, J. (2019). A review of feature selection and its methods. Cybernetics and Information Technologies, 19(1), 3–26. https://doi.org/10.2478/cait-2019-0001
[3] (https://pygis.io/docs/f_rs_ml_predict.html)
[4] https://www.learndatasci.com/tutorials/geospatial-data-python-geopandas-shapely/
[5] https://www.machinelearningplus.com/machine-learning/feature-selection/
[6] https://blog.kxy.ai/effective-feature-selection/index.html
[7] GitHub — scikit-learn-contrib/boruta_py: Python implementations of the Boruta all-relevant feature selection method.
[8] https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html
Connect with me on LinkedIn