Exploring Pixels Beneath

The #30DayMapChallenge 2024 Recap

11 min readDec 31, 2024

Halo semuanya,

As I begin my fifth year in the #30DayMapChallenge, I realize, similar to 2023, that I can’t finish it flawlessly. I won’t make many excuses, as most of my ideas have been redirected to other commitments. I’m not sure about joining this year’s challenge initially, but I decided to adopt a more opportunistic mindset by addressing several days at once. However, one thing remains certain — this challenge continues to excite me every November.

My concept was to develop a Geospatial Artificial Intelligence (GeoAI) model for benthic habitat classification. The effort turned out to be quite hard, particularly in the pre-processing. While it’s still categorized as shallow learning, the entire process actually can be broken down into several contents.

For context, the #30DayMapChallenge was initiated by Topi Tjukanov in 2019 and takes place every November. This challenge invites participants to create a map each day, with each map based on a different theme. There are no restrictions on the methods or tools used, as long as the maps are shared on social media. You can find additional information on the official website here.

By the way, my Google account, where I stored maps/images, and videos for recap content from previous years, was unexpectedly banned for no clear reason. Re-uploading everything might take some time.

Benthic habitat

Benthic habitats are environments located at the bottom of aquatic systems, whether marine or terrestrial, where various organisms interact with the substrate. Essentially, the condition of a benthic habitat is closely linked to the biodiversity it supports. In the field of geoscience, mapping benthic habitats has often been limited to objects visible in optically shallow waters [2] [3]. As a result, features like coral reefs, seagrasses, algae, and sand found below the shelf or the boundary of optically shallow water can be difficult to detect, especially when using remote sensing data.

Coral reef zonation (left) (source), also optically shallow water from remote sensing (satellite) imagery (right)

Pre-processing

Obtaining data labels

This stage aims to label the dataset that the GeoAI model will use to predict or classify benthic habitats in the imagery. The logical sequence in Land Use Land Cover (LULC) classification involves creating points, visiting those locations, and collecting samples (along with other required variables). However, the approach for benthic habitats is somewhat different.

Line transect (left), and field surveys to taking benthic habitat samples (right) [1]

The uncertainty of aquatic conditions makes it challenging to precisely locate the reference points as planned, so we use line transects instead. Technically, in the field, this line serves as a sampling reference, either through photo transects or spot checks [1]. If a single class (e.g., seagrass) is found to dominate at a certain percentage in a sample point, that point is considered representative of that class. Alternatively, it can even be classified as a mixed class if the composition percentages meet specific criteria [9].

Benthic habitat survey using the photo transect method (personal documentation)

In terms of practicality and time efficiency in the field, the photo transect method seems to be a viable option. Using a camera and a handheld GPS, photos of the substrate are taken along the predefined line transect at regular intervals, such as every meter. Integration of the documentation with GPS data (coordinates) and calculation of class percentages for each sample point can be performed post-fieldwork.

CPCe for calculating the composition of benthic habitats percentage at sample points; unfortunately, this tutorial does not include integration with GPS data (coordinates)

Reference data

The Allen Coral Atlas (ACA) v2.0 is regarded as a reliable reference for labeling datasets, especially when field data is not available. The workflow simulates a field survey through the following steps:

Create line transects.
Convert the transects into points at 1-meter intervals.
Extract pixel values, which include benthic habitat class codes and IDs.

*Use this GEE script to perform all three steps.

The transects should ideally be drawn perpendicular to the coastline — both seaward (vertical) and along the coastline (horizontal) — to capture enough class variation. Additionally, the lines should be extended to approach the reef slope. I would like to revise my earlier statement regarding samples needing to be polygons: due to the dynamic nature of field conditions, point samples are also acceptable. The important factor is to ensure a minimum number of pixels per class for representativeness (details here).

Sampling benthic habitat classes in Google Earth Engine (GEE): Line transect (left), converting lines to points (center), and overlaying sample points with the ACA benthic habitat layer (right)

Shallow water selection

Currently, remote sensing sensors are effective at capturing underwater objects only within the visible wavelength range (400–700 nm). This means that the detection of benthic habitats is limited to a certain depth, which depends on water clarity and light penetration. Therefore, selecting areas that are primarily shallow water is crucial for more efficient analysis.

The reflectance curve for water typically declines after passing the red wavelength and fades completely as it approaches near-infrared. — Spectral reflectance curve [4]

I used the Normalized Difference Water Index (NDWI) approach to segment shallow water areas. A simple way to calculate NDWI in QGIS is by using the Raster Calculator with the formula (Green — NIR) / (Green + NIR). Since I used PlanetScope imagery (8 bands), Green = B4 and NIR = B8. After that, I applied a rule of thumb to reclassify the NDWI results: values greater than 0 are considered as optically shallow water. Then, I converted the reclassified NDWI into vector format for masking the PlanetScope imagery.

I used the Normalized Difference Water Index (NDWI) approach to segment shallow water areas. A straightforward way to calculate NDWI in QGIS is by using the Raster Calculator with the formula (Green — NIR) / (Green + NIR). In my case, using PlanetScope imagery (8 bands), Green = B4 and NIR = B8. After that, I applied a rule of thumb to reclassify the NDWI results: values greater than 0 are considered optically shallow water. Finally, I converted the reclassified NDWI into vector format to mask the PlanetScope imagery.

Masking the optically shallow water on the PlanetScope data in QGIS

The data masked for shallow water is now ready for the next steps using Python.

Python script

The complete script is available in my GitHub repository here, as it is easier to update there compared to Medium. The process is divided into three Python files (.ipynb): one for preparing the training and testing dataset, another for building the machine learning (ML) model, and the last for applying the model to the PlanetScope data.

Preparing the training and testing dataset

Although I have already included markdown notes in the .ipynb files within the repository, I will add some additional details that are not included there. This stage involves four main steps: describing the bands in the PlanetScope data, extracting pixel values, creating the training and testing dataset, and handling the class imbalance.

The step of adding a description to the data can technically be skipped if we are completely familiar with the structure or order of the bands. However, incorporating features like geomorphology (e.g., bathymetry) could complicate the process. To ensure clarity and consistency, it is best to keep this step in the workflow.

# Retrieve the list of band descriptions from the dataset
desc = dataset.descriptions

# If descriptions are not available, create manual descriptions
if not desc or all(d is None for d in desc):
    desc = ["B1", "B2", "B3", "B4", "B5", "B6", "B7", "B8"]

# Create a dictionary to associate band descriptions with their indices after transpose
band_descriptions = {f"Band {i+1}": desc[i] for i in range(len(desc))}

# Display the dimensions and descriptions for each band
print("\nDimensions of the raster data after transpose:", planetscope.shape)
print("Descriptions for each band after transpose:")
for band, description in band_descriptions.items():
    print(f'{band}: {description}')

The benthic habitat biodiversity in the Gili Islands shows a minimal presence of the Microalgal Mats class, resulting in a highly imbalanced sample distribution compared to other classes. This imbalance could potentially cause the GeoAI model to become biased or perform better at predicting the majority classes [5]. To address this, I applied random oversampling to increase the sample size of the minority class. For this, I used the RandomOverSampler module from the imblearn library to oversample the training dataset.

The biodiversity of benthic habitats in the Gili Islands shows a minimal presence of the Microalgal Mats class, leading to a highly imbalanced sample distribution compared to other classes. This imbalance could potentially cause the GeoAI model to become biased or perform better at predicting the majority classes [5]. To address this issue, I applied random oversampling to increase the sample size of the minority class. For this purpose, I used the RandomOverSampler module from the imblearn library to oversample the training dataset.

Microalgal Mats class (left) and oversampling results for the minority class (right) [3].

Building the GeoAI model

The outline for building the GeoAI model includes importing the training and test datasets, hyperparameter tuning, and model evaluation. The GeoAI model is constructed using the Extreme Gradient Boosting (XGBoost) ML algorithm, implemented via the XGBClassifier function from the xgboost library. While Random Forest, another tree-based algorithm, is a more common choice for benthic habitat classification, I was intrigued by claims of XGBoost’s superior efficiency [6].

XGBoost utilizes the boosting approach, iteratively improving model performance by combining weak learners. These learners, which start with errors, are refined to reduce inaccuracies [7][8]. In contrast, Random Forest employs bagging, aggregating decision trees that are trained independently or in parallel [6].

The workflow of boosting-based ML algorithms that iteratively improve weak learners [7]

Another critical step in building the GeoAI model is hyperparameter tuning. I used the grid search method to find the most optimal combination of available XGBoost hyperparameters. Additionally, 5-Fold Cross-Validation (CV) was utilized for training and testing to assess performance. During this process, the model trains on 4 of the 5 folds (training data) and evaluates the remaining fold (testing data). This cycle repeats five times, ensuring that each fold serves as the testing set once. A complete list of XGBoost hyperparameters is available here.

# Define a list of parameters for hyperparameter tuning
# List of the parameters
param_list = {
    'n_estimators': [50, 100, 300], #  Number of trees to build in the model. Trying 50, 100, and 300 trees.
    'max_depth': [3, 6, 9], # Maximum depth of each tree. Trying depths of 3, 6, and 9.
    'eta': [0.1, 0.3, 0.5], # Controls the step size of the learning process. Trying values of 0.1, 0.3, and 0.5.
    'gamma': [0, 0.1, 0.2], # Minimum loss reduction required for further partitioning a leaf node. Trying values of 0, 0.1, and 0.2.
    'subsample': [0.8, 1.0], # Fraction of the training data to use for fitting each tree. Trying 0.8 and 1.0.
    'colsample_bytree': [0.8, 1.0], # Fraction of features to use for each tree. Trying 0.8 and 1.0.
}

# 5-fold cross-validation GridSearchCV
grid_search = GridSearchCV(
    xgb_model,            # The model you want to tune (XGBoost classifier)
    param_list,           # The grid of hyperparameters to search
    cv=5,                 # Number of folds for cross-validation (5-fold)
    scoring='accuracy',   # The metric used to evaluate model performance (accuracy)
    n_jobs=-1,            # Use all available CPU cores for parallel processing
    verbose=5             # Display progress and information (higher value gives more details)
)

The classification evaluation results revealed an overall accuracy (OA) of only 60.19%, which is considered unsatisfactory. The Seagrass class demonstrated the best performance, achieving a Precision/User Accuracy (UA) of 76.93% and a Recall/Producer Accuracy (PA) of 77.63%. In contrast, the Microalgal Mats class showed significantly lower metrics, with a UA of 16% and a PA of 18.18%, despite the application of oversampling. This level of accuracy is understandable, given the inherent difficulty in distinguishing the spectral features of underwater objects, especially in the absence of pre-processing to enhance the quality of the image data.

Confusion matrix of benthic habitat classification in Gili Islands

Applying the model

There are no specific notes for the stage intended to implement the XGBoost model on the PlanetScope dataset. However, the classification output consistently indicates that deep water and land are categorized as Rock. I have assigned a value of -9999 to the NoData (null) pixels in the function for processing the dataset. A temporary solution is to mask with a shallow water polygon.

Benthic habitat classification output in Gili Islands

Since map visualization in Python is less intuitive, I prefer to create the map layout elsewhere to enhance the options. I use QGIS and Figma as usual.

Sample points, line transect and ndwi (shallow water area)

I just noticed while making the map layout that there is a thin haze or clouds (?) in the shallow water area in the northern part of Gili Meno 😔

Benthic habitat classification map in Gili Islands

References

[1] Roelfsema, C. (2010). Integrating field data with high spatial resolution multispectral satellite imagery for calibration and validation of coral reef benthic community maps. Journal of Applied Remote Sensing, 4(1), 043527. https://doi.org/10.1117/1.3430107

[2] Hedley, J. D., Roelfsema, C., Brando, V., Giardino, C., Kutser, T., Phinn, S., Mumby, P. J., Barrilero, O., Laporte, J., & Koetz, B. (2018b). Coral Reef Applications of Sentinel-2: Coverage, characteristics, bathymetry and benthic mapping with comparison to Landsat 8. Remote Sensing of Environment, 216, 598–614. https://doi.org/10.1016/j.rse.2018.07.014

[3] Wicaksono, P., Aryaguna, P. A., & Lazuardi, W. (2019). Benthic habitat mapping model and cross-validation using machine-learning classification algorithms. Remote Sensing, 11(11), 1279. https://doi.org/10.3390/rs11111279

[4] O’Donohue, D. (2024, October 1). Understanding spectral reflectance in remote sensing. mapscaping.com. https://mapscaping.com/understanding-spectral-reflectance-in-remote-sensing/. Diakses pada 24 Desember 2024.

[5] Tang, T. (2023, April 25). Class imbalance strategies — A visual guide with code. Medium. https://towardsdatascience.com/class-imbalance-strategies-a-visual-guide-with-code-8bc8fae71e1a. Accessed 24 December 2024.

[6] Shao, Z., Ahmad, M. N., & Javed, A. (2024). Comparison of Random Forest and XGBoost Classifiers Using Integrated Optical and SAR Features for Mapping Urban Impervious Surface. Remote Sensing, 16(4), 665. https://doi.org/10.3390/rs16040665

[7] Chen, T., & Guestrin, C. (2016). XGBoost. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. https://doi.org/10.1145/2939672.2939785

[8] GeeksforGeeks. (2023, February 6). XGBoost. https://www.geeksforgeeks.org/xgboost/. Accessed 24 December 2024.

[9] Wicaksono, P., & Harahap, S. D. (2023). Mapping seagrass biodiversity indicators of Pari Island using multiple worldview-2 bands derivatives. Geosfera Indonesia, 8(2), 189. https://doi.org/10.19184/geosi.v8i2.41214

Connect with me on LinkedIn