Group 24¶

group members: Arthur Hu, Ge Li, Jiapeng Wang, Zhixing Liu, Zifu Wang

In [1]:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

Description of dataset¶

At a bank, it is found that more and more customers are leaving their credit card services. So in this project we are going to predict which customer is gonna drop off so the bank can turn customers' decisions in the opposite direction.

This dataset consists of 10,000 customers mentioning their age, salary, marital_status, credit card limit, credit card category, etc. There are nearly 20 features.

Here is the the first few lines of the data set：

In [2]:

df = pd.read_csv('BankChurners.csv')
df.head()

Out[2]:

	CLIENTNUM	Attrition_Flag	Customer_Age	Gender	Dependent_count	Education_Level	Marital_Status	Income_Category	Card_Category	Months_on_book	...	Months_Inactive_12_mon	Contacts_Count_12_mon	Credit_Limit	Total_Revolving_Bal	Avg_Open_To_Buy	Total_Amt_Chng_Q4_Q1	Total_Trans_Amt	Total_Trans_Ct	Total_Ct_Chng_Q4_Q1	Avg_Utilization_Ratio
0	768805383	Existing Customer	45	M	3	High School	Married	$60K - $80K	Blue	39	...	1	3	12691.0	777	11914.0	1.335	1144	42	1.625	0.061
1	818770008	Existing Customer	49	F	5	Graduate	Single	Less than $40K	Blue	44	...	1	2	8256.0	864	7392.0	1.541	1291	33	3.714	0.105
2	713982108	Existing Customer	51	M	3	Graduate	Married	$80K - $120K	Blue	36	...	1	0	3418.0	0	3418.0	2.594	1887	20	2.333	0.000
3	769911858	Existing Customer	40	F	4	High School	Unknown	Less than $40K	Blue	34	...	4	1	3313.0	2517	796.0	1.405	1171	20	2.333	0.760
4	709106358	Existing Customer	40	M	3	Uneducated	Married	$60K - $80K	Blue	21	...	1	0	4716.0	0	4716.0	2.175	816	28	2.500	0.000

5 rows × 21 columns

Questions¶

1.Generate a model to predict which customers will leave.

2.Determine which features will have the most significant impact on customer churn.

3.Judge the advantages and disadvantages of each feature processing method and machine learning method through this data set.

Variables¶

Below are the feature names, and here are some explanations for several complex feature names.

Total_Revolving_Bal: Total Revolving Balance on the Credit Card

Avg_Open_To_Buy: Open to Buy Credit Line (Average of last 12 months)

Total_Amt_Chng_Q4_Q1: Change in Transaction Amount (Q4 over Q1)

Total_Trans_Amt: Total Transaction Amount (Last 12 months)

Total_Trans_Ct: Total Transaction Count (Last 12 months)

Total_Ct_Chng_Q4_Q1: Change in Transaction Count (Q4 over Q1)

Avg_Utilization_Ratio: Average Card Utilization Ratio

In [3]:

df.columns

Out[3]:

Index(['CLIENTNUM', 'Attrition_Flag', 'Customer_Age', 'Gender',
       'Dependent_count', 'Education_Level', 'Marital_Status',
       'Income_Category', 'Card_Category', 'Months_on_book',
       'Total_Relationship_Count', 'Months_Inactive_12_mon',
       'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
       'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
       'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio'],
      dtype='object')

Methods¶

In the process of data processing, we will try to use as many feature engineering methods as we have learned in class, such as one-hot encoding, binning, rescaling, data imputation(if needed), feature selection and feature importance.

For prediction, in general, we will try to make predictions using all the classification methods we learned in this class, including logistic regression, decision tree, SVM, kNN(if useful) and several ensemble learning methods(like randomforest).

In [ ]: