Forest cover classification | Eduardo Gómez González

Introduction
Technology Used
Code Development
Results
GitHub Repository

1. Introduction

The project focuses on classifying forest cover types using cartographic variables like elevation, slope, and soil type. The dataset, derived from the U.S. Forest Service, contains 30-meter square patches of land, and each patch is labeled with one of seven forest cover types. The goal is to build a machine learning model that can predict the cover type based on the available features, aiding in environmental management and forestry practices.

This project was carried out in collaboration with Oier Huici Itarte.

For more information, you can visit the Kaggle competition page.

2. Technology Used

The project was developed using Python as the primary programming language. Key libraries such as Scikit-learn, Pandas, and NumPy were utilized to perform tasks related to machine learning, data manipulation, and numerical computations. These libraries provided the essential tools for building the classifier, preprocessing data, and optimizing the model’s performance.

3. Code Development

3.1 Data Preprocessing

The methodology used for data preprocessing involved:

Removing outliers: Using the Isolation Forest algorithm from Scikit-learn to identify and remove outliers, reducing noise when training the classifier.
Handling class imbalance: No preprocessing was applied for class imbalance since the chosen algorithm, XGBoost, handles this internally.

3.2 Algorithm Selection

An XGBoost classifier was trained using the training dataset to obtain the probability of each test sample belonging to each class. Weighted scores were then calculated by multiplying these probabilities by the cost matrix, allowing the final class of each sample to be determined based on the lowest weighted score. This approach optimizes costs of classification errors according to the penalties defined by the cost matrix.

3.3 Hyperparameter Selection

For hyperparameter optimization, the GridSearchCV algorithm from Scikit-learn was used. This method performed a parametric sweep with different values for each hyperparameter to find the best estimation. A cross-validation value of 3 was used, which divides the training dataset into three equal parts and uses one as the test set in each iteration.

In our specific case, the classifier’s optimized parameters were:

n_estimators
learning_rate
max_depth
subsample

4. Results

The developed solution achieved first place in the Kaggle competition by making predictions with the lowest average cost associated with classification errors. This performance was based on minimizing the penalties defined in the cost matrix, demonstrating the effectiveness of the model and its ability to optimize for real-world costs of misclassifications.

5. GitHub Repository

You can find the code for this project on GitHub.

Table of Contents