The objective of this project is to develop a predictive model that forecasts credit card approval based on applicant data. We aim to identify the key factors influencing credit card approval decisions by analyzing various applicant characteristics. We will focus on model interpretability to understand the importance of different variables and explore potential biases related to age, gender, or other demographic factors. Mitigating these biases is essential to develop a fairer model.
The dataset has categorical and numerical variables, which will require pre-processing. Specifically, we will handle missing values, encode categorical features, and scale numerical variables for optimal model performance.
Below is a sample code snippet to load the dataset and preview the data:
# Import necessary libraries
import pandas as pd
# Load the dataset
application_record = pd.read_csv('/Users/shats/Documents/VSCode/everything/STAT451/Project/application_record.csv')
credit_record = pd.read_csv('/Users/shats/Documents/VSCode/everything/STAT451/Project/credit_record.csv')
application_record.head()
ID | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | DAYS_BIRTH | DAYS_EMPLOYED | FLAG_MOBIL | FLAG_WORK_PHONE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5008804 | M | Y | Y | 0 | 427500.0 | Working | Higher education | Civil marriage | Rented apartment | -12005 | -4542 | 1 | 1 | 0 | 0 | NaN | 2.0 |
1 | 5008805 | M | Y | Y | 0 | 427500.0 | Working | Higher education | Civil marriage | Rented apartment | -12005 | -4542 | 1 | 1 | 0 | 0 | NaN | 2.0 |
2 | 5008806 | M | Y | Y | 0 | 112500.0 | Working | Secondary / secondary special | Married | House / apartment | -21474 | -1134 | 1 | 0 | 0 | 0 | Security staff | 2.0 |
3 | 5008808 | F | N | Y | 0 | 270000.0 | Commercial associate | Secondary / secondary special | Single / not married | House / apartment | -19110 | -3051 | 1 | 0 | 1 | 1 | Sales staff | 1.0 |
4 | 5008809 | F | N | Y | 0 | 270000.0 | Commercial associate | Secondary / secondary special | Single / not married | House / apartment | -19110 | -3051 | 1 | 0 | 1 | 1 | Sales staff | 1.0 |
credit_record.head()
ID | MONTHS_BALANCE | STATUS | |
---|---|---|---|
0 | 5001711 | 0 | X |
1 | 5001711 | -1 | 0 |
2 | 5001711 | -2 | 0 |
3 | 5001711 | -3 | 0 |
4 | 5001712 | 0 | C |
application_record.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 438557 entries, 0 to 438556 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 438557 non-null int64 1 CODE_GENDER 438557 non-null object 2 FLAG_OWN_CAR 438557 non-null object 3 FLAG_OWN_REALTY 438557 non-null object 4 CNT_CHILDREN 438557 non-null int64 5 AMT_INCOME_TOTAL 438557 non-null float64 6 NAME_INCOME_TYPE 438557 non-null object 7 NAME_EDUCATION_TYPE 438557 non-null object 8 NAME_FAMILY_STATUS 438557 non-null object 9 NAME_HOUSING_TYPE 438557 non-null object 10 DAYS_BIRTH 438557 non-null int64 11 DAYS_EMPLOYED 438557 non-null int64 12 FLAG_MOBIL 438557 non-null int64 13 FLAG_WORK_PHONE 438557 non-null int64 14 FLAG_PHONE 438557 non-null int64 15 FLAG_EMAIL 438557 non-null int64 16 OCCUPATION_TYPE 304354 non-null object 17 CNT_FAM_MEMBERS 438557 non-null float64 dtypes: float64(2), int64(8), object(8) memory usage: 60.2+ MB
application_record.isnull().sum()
ID 0 CODE_GENDER 0 FLAG_OWN_CAR 0 FLAG_OWN_REALTY 0 CNT_CHILDREN 0 AMT_INCOME_TOTAL 0 NAME_INCOME_TYPE 0 NAME_EDUCATION_TYPE 0 NAME_FAMILY_STATUS 0 NAME_HOUSING_TYPE 0 DAYS_BIRTH 0 DAYS_EMPLOYED 0 FLAG_MOBIL 0 FLAG_WORK_PHONE 0 FLAG_PHONE 0 FLAG_EMAIL 0 OCCUPATION_TYPE 134203 CNT_FAM_MEMBERS 0 dtype: int64
credit_record.isnull().sum()
ID 0 MONTHS_BALANCE 0 STATUS 0 dtype: int64
# Create correlation matrix for numerical features
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# Select numerical columns
numerical_cols = application_record.select_dtypes(include=['int64', 'float64']).columns
# Create correlation matrix
plt.figure(figsize=(12, 10))
#exclude flag mobil
correlation_matrix = application_record[numerical_cols].drop(columns=['FLAG_MOBIL']).corr()
# Create heatmap with customized appearance
sns.heatmap(correlation_matrix,
annot=True, # Show correlation values
cmap='coolwarm', # Color scheme
center=0, # Center the colormap at 0
fmt='.2f', # Round to 2 decimal places
square=True, # Make cells square
linewidths=0.5) # Add gridlines
plt.title('Correlation Matrix of Numerical Features', pad=20)
plt.tight_layout()
plt.show()
# Print strongest correlations (optional)
print("\nStrongest correlations:")
# Get the upper triangle of the correlation matrix
upper = correlation_matrix.where(np.triu(np.ones(correlation_matrix.shape), k=1).astype(bool))
# Stack the correlations and sort by absolute value
strongest_correlations = upper.unstack()
strongest_correlations = strongest_correlations[strongest_correlations != 0].sort_values(key=abs, ascending=False)
print(strongest_correlations.head(10))
Strongest correlations: CNT_FAM_MEMBERS CNT_CHILDREN 0.884781 DAYS_EMPLOYED DAYS_BIRTH -0.617908 DAYS_BIRTH CNT_CHILDREN 0.349088 CNT_FAM_MEMBERS DAYS_BIRTH 0.306179 FLAG_PHONE FLAG_WORK_PHONE 0.290066 DAYS_EMPLOYED CNT_CHILDREN -0.241535 CNT_FAM_MEMBERS DAYS_EMPLOYED -0.234373 FLAG_WORK_PHONE DAYS_EMPLOYED -0.232208 DAYS_BIRTH 0.171829 DAYS_EMPLOYED AMT_INCOME_TOTAL -0.141291 dtype: float64