Will a borrower repay their loan in full? How can we classify whether a person will pay back their loan based on various features like FICO score, debt-to-income ratio, and interest rate?
Lenders earn money from people who repay their loans, but lose money from people who do not repay their entire loan. So our goal will be to predict, with as much accuracy as possible, whether someone will repay their loan based on a range of variables. In application, this model could help determine whether or not a person’s loan application should be approved or denied.
This dataset was found on Kaggle, but the data itself came from LendingClub.com where it is publicly available.
The dataset has lending data from 2007-2010 and includes 14 variables, but we will be using a subset of those variables for our purposes. We will be predicting the not.fully.paid variable for debt consolidation loans based on the variables int.rate, installment, log.annual.inc, dti, fico, days.with.cr.line, revol.bal, and delinq.2yrs, which we will describe below.
We will be using a decision tree classifier to model this problem. Our feature names will be int.rate, installment, log.annual.inc, dti, fico, days.with.cr.line, revol.bal, and delinq.2yrs. Our y value that we are predicting is the not.fully.paid variable. At each node, we will choose the feature and threshold on which to split by minimizing the cost associated with the split. The best subset pair is the one that minimizes the weighted average entropy of the split.
import pandas as pd
loans_raw = pd.read_csv("https://raw.githubusercontent.com/mrbarron3/stat451/main/loan_data.csv")
# if the line above doesn't work for some reason, download the csv file to the same directory as this file
# uncomment the next line if need be
# pd.read_csv("loan_data.csv")
print(len(loans_raw))
9578
loans = loans_raw[["purpose", "int.rate", "installment", "log.annual.inc", "dti", "fico", "days.with.cr.line", "revol.bal", "delinq.2yrs", "not.fully.paid"]]
debt = loans[loans["purpose"] == "debt_consolidation"]
print(len(debt))
debt.head()
3957
purpose | int.rate | installment | log.annual.inc | dti | fico | days.with.cr.line | revol.bal | delinq.2yrs | not.fully.paid | |
---|---|---|---|---|---|---|---|---|---|---|
0 | debt_consolidation | 0.1189 | 829.10 | 11.350407 | 19.48 | 737 | 5639.958333 | 28854 | 0 | 0 |
2 | debt_consolidation | 0.1357 | 366.86 | 10.373491 | 11.63 | 682 | 4710.000000 | 3511 | 0 | 0 |
3 | debt_consolidation | 0.1008 | 162.34 | 11.350407 | 8.10 | 712 | 2699.958333 | 33667 | 0 | 0 |
6 | debt_consolidation | 0.1496 | 194.02 | 10.714418 | 4.00 | 667 | 3180.041667 | 3839 | 0 | 1 |
9 | debt_consolidation | 0.1221 | 84.12 | 10.203592 | 10.00 | 707 | 2730.041667 | 5630 | 0 | 0 |
debt.describe()
int.rate | installment | log.annual.inc | dti | fico | days.with.cr.line | revol.bal | delinq.2yrs | not.fully.paid | |
---|---|---|---|---|---|---|---|---|---|
count | 3957.000000 | 3957.000000 | 3957.000000 | 3957.000000 | 3957.000000 | 3957.000000 | 3957.000000 | 3957.000000 | 3957.000000 |
mean | 0.126595 | 358.984390 | 10.912909 | 14.076462 | 703.871367 | 4533.037139 | 17146.710639 | 0.163255 | 0.152388 |
std | 0.024769 | 198.309002 | 0.547477 | 6.433460 | 34.397778 | 2340.567954 | 24167.207708 | 0.561788 | 0.359442 |
min | 0.060000 | 23.210000 | 7.547502 | 0.000000 | 612.000000 | 180.041667 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.111400 | 201.520000 | 10.571317 | 9.200000 | 677.000000 | 2925.000000 | 5494.000000 | 0.000000 | 0.000000 |
50% | 0.128000 | 325.080000 | 10.903815 | 14.240000 | 697.000000 | 4114.041667 | 10868.000000 | 0.000000 | 0.000000 |
75% | 0.142600 | 491.300000 | 11.238436 | 19.130000 | 727.000000 | 5639.958333 | 19469.000000 | 0.000000 | 0.000000 |
max | 0.212100 | 940.140000 | 14.528354 | 29.960000 | 822.000000 | 16259.041670 | 290341.000000 | 13.000000 | 1.000000 |