Building a Predictor for Credit Default Risk Evaluation

Machine Learning Random Forest XGBoost Financial Data Risk Modeling Incremental Learning

Motivation
Exploratory Data Analysis
Feature Engineering
Hyperparamter Tunning
Results
Challenges
Impact

Motivation

For a person who applies to a credit loan, there is a probability for credit default (risk). I train machine learning algorithms to build a probability model to classify if a loan application should be approved or not.

Exploratory Data Analysis

I identified some few main issues to be addressed after the initial data analysis.

Having some features with missing values
Imbalanced class in the target variable

Addressing Missing Values

First identified the nature (MCAR or MAR or MNAR) of the missing values with the help of missingno python library.

Addressing Imbalanced classes

Compared the performances for applying SMOTE and ADASYN (Adaptive Synthetic) algorithm to balance the Target variable.

Feature Engineering

We appplied different methods to select important features so it will reduce the computational time, the risk of overfitting and complexity of interpretation.

Recursive Feature Elimination (RFE)
\( \chi^2 \)
Univariate Feature Selection : ANOVA F-value
Information Value (IV) and Weight of evidence (WoE)
Correlation
Threshold
Bortua Algorithm

Hyperparamter Tunning

Applied Optuna that utillizes Bayesian optimization algorithm for sampling hyperparamtes

Results

Evolution of Precision-Recall AUC score during the training of a XGBoost model.

Precision-Recall Curve for the best XGBoost model.

Challenges

Impact

The model has a 0.66 ROC-auc score.