Feature Engineering: Feature Selection

The initial and fundamental stage of developing a GeoAI model

Wahyu Ramadhan
10 min readDec 2, 2024

Halo semuanya,

A paradigm shift occurred in geoscience branches like geographic information science and remote sensing. The focuses have shifted from algorithmic approaches using mainly deterministic models for geospatial analysis to the development of heuristics, supported, for example, by Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL). This statement is my conclusion after various insights during the lectures and with professors on campus.

This transition makes us flexible in identifying patterns from data without having to follow through a rigid set of rules. In this context, AI can develop and induce insights from data, not unlike human intuition in the process of decision-making. While the results are usually not 100% perfect, there is justification that explains these imperfections. But that is just exactly where the beauty lies.

I might start content on this blog with the theme of Geospatial Artificial Intelligence (GeoAI). I don’t think it will be a series like #BelajarGEE. Besides being easy to adjust, GEE can be integrated with GeoAI through Python, as shown in content from my friend Yakub Hariana. As is commonly done, data processing should begin with pre-processing, and one important step is Feature Engineering.

In the era of big data today, one aspect of feature engineering — namely feature selection — can greatly assist our work. When preparing high-dimensional data for GeoAI, whether for Machine Learning (ML) or Deep Learning (DL), feature selection helps reduce noise, and eliminate redundant and irrelevant features, which ultimately decreases computational load and improves model accuracy. The assumption is that by processing only the most relevant features, the AI model becomes better at interpretation, prediction, and preventing overfitting [1] [2].

To give a brief example, let’s say we want to perform multispectral classification using an image with 12 bands. Not all features (in this case, the bands of the image) are necessarily significant for predictions made by the AI model. The irrelevant features are removed to improve the efficiency or accuracy of the model. This aligns with what American mathematician Richard Ernest Bellman said about the Curse of Dimensionality in 1961. We are experiencing the effects of this now due to the exponential growth of data. I believe the compromise lies in managing data more efficiently.

Feature selection (sumber)

Preparing Training Samples

Download the Required Materials

The materials I use for performing feature selection are twofold: raster imagery and training samples for labeling the raster data. To make it easier, I’m using data from my previous content. Download the materials below:

  1. Sentinel-2 data
  2. Training sample for labeling the data

Convert Training Samples to Raster

This step can be done using QGIS. Here’s how:

  1. Convert the projection system of the vector-format training sample (the one I used has a .geojson extension) to UTM.
  2. Rasterize it using the settings shown below.
  3. Reproject the raster back to the geographic coordinate system
Rasterize the training sample

The training samples used represent land cover classes, with each class labeled by a number from 1 to 7, corresponding to the following object names in order: water, built-up land, agricultural vegetation, vegetation, mangrove, low vegetation, and clouds. When converted to raster format, these numbers change and are recognized as raster values.

Class label(left) represented by class number in the ‘lc’ column (right)

Preparing the Anaconda environment

We will use Python to process the data from here. There are no restrictions on the Integrated Development Environment (IDE) that can be used; it can be cloud-based like Google Colab or Kaggle or desktop-based like Jupyter Notebook, PyCharm, and Visual Studio Code (VSCode) with Jupyter and Python extensions. However, so far, I feel most comfortable using VSCode.

I use Anaconda because it comes with many pre-installed Python libraries and packages. After installing Anaconda, the next step is to create a virtual environment to avoid conflicts between different versions of libraries and Python packages. To create and activate it, open the Anaconda Prompt. If you’re using Windows, search for ‘Anaconda Prompt’ in the Start menu. Then, enter the following command if you want to create a virtual environment named ‘geo’.

conda create -n geo python=3.12
conda activate geo

Make sure with ctrl+shift+P, then Select Interpreter and choose the previously created virtual environment.

Select Python environment and install Jupyer extension in VSCode

Create a Notebook File and Organize the Folder Structure

Place the training sample and image files in the ‘Data’ folder, which should be inside the main project folder. Then, create a Python notebook file with the .ipynb extension inside it. To do this, right-click on the project folder and select ‘New File’.

Folder structure and Python notebook file creation

Convert training sample into CSV

We need to convert the training samples into a tabular format, as most Python libraries work with this format. This type of data is also referred to as a Python dataframe when it has column headers, similar to attribute tables in vector data. Meanwhile, a matrix or Python array (which can be 2D or multidimensional) represents raster data. Why is this the case? Because Python libraries, including those for Machine Learning (ML) such as Scikit-learn, TensorFlow, and PyTorch, are designed to be compatible with tabular data manipulation [3] [4]. The most common libraries for handling Python arrays are Numpy, Pandas for dataframes, and Geopandas for geospatial dataframes.

Python array and dataframe (left) (source), type of array dimensions (right), if the data is multispectral which also means a multidimensional array (source)

Python libraries used:

  1. rasterio: Recognizes raster geospatial data, including resampling (changing spatial resolution) and changing projection systems.
  2. numpy: Recognizes and manipulates Python arrays.
  3. pandas: Manipulates Python dataframes.
import rasterio
from rasterio.enums import Resampling
from rasterio.warp import reproject, calculate_default_transform
from rasterio.mask import mask
import numpy as np
import pandas as pd

def training_sample(raster_path, roi_path, output_csv_path):
# Load raster data
raster = rasterio.open(raster_path)

# Load training sample
roi = rasterio.open(roi_path)

# Calculate the transform and dimensions for the resampled roi
transform, width, height = calculate_default_transform(roi.crs, raster.crs, raster.width, raster.height, *raster.bounds)

# Create an array to hold the resampled data
roi_resampled = np.empty(shape=(height, width), dtype=roi.dtypes[0])

# Reproject the roi to match the raster data
reproject(
source=rasterio.band(roi, 1),
destination=roi_resampled,
src_transform=roi.transform,
src_crs=roi.crs,
dst_transform=raster.transform,
dst_crs=raster.crs,
resampling=Resampling.nearest
)

# Create a mask from the resampled roi
roi_mask = roi_resampled != roi.nodata

# Apply the mask to the raster data
raster_data = raster.read()
raster_masked = np.where(roi_mask, raster_data, np.nan)

# Stack the layers
stacked_data = np.vstack([roi_resampled[np.newaxis, ...], raster_masked])

# Convert to DataFrame, set the class and band names (depend on what raster data that you use, this case is for Sentinel-2)
df = pd.DataFrame(
stacked_data.reshape(stacked_data.shape[0], -1).T,
columns=["class", "B2", "B3", "B4", "B5", "B6", "B7", "B8", "B8A", "B9", "B11", "B12"])

# Filter out unwanted class values and ensure 'class' column is integer
valid_classes = [1, 2, 3, 4, 5, 6, 7]
df = df[df['class'].isin(valid_classes)]

# Convert 'class' column to integer
df['class'] = df['class'].astype(int)

# Remove rows with missing values
df_clean = df.dropna()

# Save to CSV
df_clean.to_csv(output_csv_path, index=False)

print(f"CSV file created succesfully! {df_clean}")

# Run the function
training_sample(
raster_path=r"....Your Path\Data\s2_median.tif",
roi_path=r"....Your Path\Data\Training sample Lampung_gcs.tif",
output_csv_path=r"....Your Path\Data\1. training_data_new2.csv"
)

The output CSV will have a column named class representing the land cover class, followed by the names of the bands from Sentinel-2.

CSV output

Feature selection process

There are several methods of feature selection; this time, I’ll try the most common: Boruta and Recursive Feature Elimination. These two methods are considered efficient, flexible in integration with various algorithms, and allow a reduction of overfitting. Of course, we have to make compromises. For instance, Boruta selects the features that are likely relevant but are quite resource-consuming. On the other hand, RFE is lighter but may remove important features [5] [6].

Import the Boruta library

Let’s start with using Boruta. To import the library, use the following command. If you do not have it installed on your computer, open the Anaconda prompt and use the command conda install boruta to download it. In this step, we will import Boruta and RandomForestClassifier from Scikit-learn.

import sklearn
import boruta

from sklearn.ensemble import RandomForestClassifier

Load training sample (.csv) that was previously created.

sample_path = r"....Your Path\Data\1. training_data.csv"

df = pd.read_csv(sample_path)
print(f"Training Data Prop {df.head()}", f"Training Data{df.info}")

A little tip to quickly get the directory of a file in VSCode, right-click the file in question and select Copy Path.

Quickly get file directory in the VSCode

Split data into training and test sets

As is common in machine learning, we need to split the sample data into two sets: one for training and one for testing. Here’s a brief explanation:

  • x = df.iloc[:, 1:]: This selects all columns from the dataframe (training sample) except the first column, which contains the class labels.
  • y = df.iloc[:, 0]: The first column of the dataframe is used as the target.
from sklearn.model_selection import train_test_split

x = df.iloc[:, 1:]
y = df.iloc[:, 0]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)
print(f"X train {x_train}")
print(f"X test {x_test}")
print(f"Y test {y_test}")
print(f"Y train {y_train}")

Feature Selection with Boruta

This step involves configuring feature selection with Boruta. I want to be a bit more aggressive by adjusting some parameters, such as:

  • max_depth = 10: A larger tree depth, with the assumption that it can capture more complex patterns.
  • alpha = 0.01: Tightening the selection criteria, meaning only the most significant features will be selected. The risk is that some features, which may be relevant, might be overlooked due to the stricter criteria.
  • max_iter = 50: The maximum number of iterations in the feature selection process.

This aggressive configuration is not always necessary, considering its consequences, such as increased computational load and the risk of overfitting.

from boruta import BorutaPy

# Numpy file type, in case you have a newer version of Numpy
np.int = np.int32
np.float = np.float64
np.bool = np.bool_

# max_depth: A deeper tree might capture more complex patterns, but it could also lead to overfitting.
rf = RandomForestClassifier(n_jobs=1, class_weight='balanced', max_depth=10)

# Define boruta selection method
boruta_selector = BorutaPy(
rf,
n_estimators = 'auto',
verbose = 2,
random_state = 1,
alpha = 0.01,
max_iter = 50 # Number of iterations, in case you want to do more 'aggresive' approach
)

# Find all relevant features - 5 features should be selected - Convert training data to numpy arrays
x_train2 = np.array(x_train)
y_train2 = np.array(y_train)

boruta_selector.fit(x_train2, y_train2)

print(f"Ranking: {boruta_selector.ranking_}")
print(f"Nr. of significant features: {boruta_selector.n_features_}")

The process will show this kind of report in the console that explains the result of iteration and the number of relevant features selected.

Feature selection report with Boruta

Save the training and test data

Use the following command to save the results of feature selection with Boruta.

# Feature selection
feature_index = boruta_selector.ranking_.astype(bool)

# Save training data
selected_train = pd.concat([x_train.loc[:, feature_index], y_train], axis=1)
selected_train.to_csv(r"....Your Path\Data\2. selected_features_train_data_boruta.csv", index=False)

# Save test data
selected_test = pd.concat([x_test.loc[:, feature_index], y_test], axis=1)
selected_test.to_csv(r"...Your Path\Data\3. selected_features_test_data_boruta.csv", index=False)

Feature selection with RFE

The steps are similar to Boruta, but the difference lies in the RFE configuration. You should still use the training and test data that has already been split for the configuration below. Additionally, we will print the names and count of the features that have been successfully selected in the console.

The description of the RFE configuration used is as follows:

  • estimator: Just like in Boruta, use RandomForestClassifier it as the base model.
  • step: The number of features to remove at each iteration.
  • cv (5): Use StratifiedKFold with 5 folds for cross-validation.
  • scoring: The metric used to evaluate the model’s performance, in this case, is ‘accuracy’.
  • min_features_to_select: The minimum number of features to select.
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold

# Configure RFE selector
selector = RFECV(
estimator=RandomForestClassifier(n_jobs=1, class_weight='balanced', max_depth=10),
step = 1 ,
cv = StratifiedKFold(5),
scoring='accuracy',
min_features_to_select=10,
n_jobs = 1
)

selector = selector.fit(x_train, y_train)

# Get selected feature names and number
feature_names = selector.get_feature_names_out()
feature_number = selector.n_features_

# Print feature name and number in the console
print(f"Selected features {feature_names}")
print(f"Number of selected features {feature_number}")

The result will look like this. Compare it with the training data that hasn’t gone through feature selection. Pay attention to whether any Sentinel-2 bands were eliminated during this process.

Sentinel-2 features considered relevant by RFE

Save the training and test data

Use the following command to save the results of feature selection with RFE.

# Save filtered training data
pd.concat([x_train[feature_names], y_train], axis=1).to_csv(
r"...Your Path\Data\2. selected_features_train_data_rfe.csv",
index=False
)

# Save filtered test data
pd.concat([x_test[feature_names], y_test], axis=1).to_csv(
r"...Your Path\Data\3. selected_features_test_data_rfe.csv",
index=False
)

To make sure…

Both Boruta and RFE will generate this kind of CSV or dataframe.

RFE feature selection result

Connect with me on LinkedIn

--

--

Wahyu Ramadhan
Wahyu Ramadhan

Written by Wahyu Ramadhan

Mapping my way through the GIScience universe. Join me on this journey!

No responses yet