STAT 451 Project Proposal¶

Predicting Loan Approval Based on Applicant Information¶

Benjamin Broide, Dominic Unterriker, Hae Seung Pyun, Kang Wei Fong, Youngwoo Kim¶

In [3]:
import pandas as pd
In [4]:
# Load dataset
df = pd.read_csv("/Users/kimyoungwoo/Downloads/loan_approval_dataset.csv")

# Preview dataset
print(df.head())
   loan_id   no_of_dependents      education  self_employed   income_annum  \
0        1                  2       Graduate             No        9600000   
1        2                  0   Not Graduate            Yes        4100000   
2        3                  3       Graduate             No        9100000   
3        4                  3       Graduate             No        8200000   
4        5                  5   Not Graduate            Yes        9800000   

    loan_amount   loan_term   cibil_score   residential_assets_value  \
0      29900000          12           778                    2400000   
1      12200000           8           417                    2700000   
2      29700000          20           506                    7100000   
3      30700000           8           467                   18200000   
4      24200000          20           382                   12400000   

    commercial_assets_value   luxury_assets_value   bank_asset_value  \
0                  17600000              22700000            8000000   
1                   2200000               8800000            3300000   
2                   4500000              33300000           12800000   
3                   3300000              23300000            7900000   
4                   8200000              29400000            5000000   

   loan_status  
0     Approved  
1     Rejected  
2     Rejected  
3     Rejected  
4     Rejected  

Research question¶

Can we predict whether a loan will be approved or not based on the financial and demographic information of the borrower?

Variable Description¶

  • Income (numeric) : Client annual income
  • Loan_amount (numeric): Amount of money requested for the loan
  • Bank_assets_value (numeric): Total value of the client’s bank assets (Similar for Luxury, Residential, Commercial, and Total assets)
  • Cibil_score (numeric) : Client’s credit score
  • Self_employed (categorical) : A binary categorical variable. 1 if the client is self employed, 0 if not
  • Education (categorical) : A binary categorical variable. 1 if the client has graduated from college, 0 if not
  • Loan_status (categorical) : The target variable. 1 if the loan was approved, 0 if not
  • Credit ratio : Ratio of client’s loan amount to total assets

Methods¶

  • Before working with the data, we need to clean it and perform some feature engineering. We want to drop two of the columns, loan_term and loan_id, then we bin all the numeric data, and finally convert the education and self_employed columns to binary.
  • We plan to create new variables by mutating the variables given. We believe it can help us provide better prediction model. One such example is 'credit ratio' which indicates ratio of client's loan amount to toal assets.
  • We plan to use logistic regression/classification, decision tree classification, KNN classification, and SVM classification, and evaluate which model is the best at predicting loan approval status.