STAT 451 Project Proposal - Group 7¶
Description of the Dataset¶
The Credit Card Transactions Dataset provides detailed records of credit card transactions, including information about transaction times, amounts, and associated personal and merchant details. This dataset has over 1.85M rows.
Code¶
In [5]:
import pandas as pd
df = pd.read_csv('credit_card_transactions.csv')
df.head(10)
Out[5]:
Unnamed: 0 | trans_date_trans_time | cc_num | merchant | category | amt | first | last | gender | street | ... | long | city_pop | job | dob | trans_num | unix_time | merch_lat | merch_long | is_fraud | merch_zipcode | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2019-01-01 00:00:18 | 2703186189652095 | fraud_Rippin, Kub and Mann | misc_net | 4.97 | Jennifer | Banks | F | 561 Perry Cove | ... | -81.1781 | 3495 | Psychologist, counselling | 1988-03-09 | 0b242abb623afc578575680df30655b9 | 1325376018 | 36.011293 | -82.048315 | 0 | 28705.0 |
1 | 1 | 2019-01-01 00:00:44 | 630423337322 | fraud_Heller, Gutmann and Zieme | grocery_pos | 107.23 | Stephanie | Gill | F | 43039 Riley Greens Suite 393 | ... | -118.2105 | 149 | Special educational needs teacher | 1978-06-21 | 1f76529f8574734946361c461b024d99 | 1325376044 | 49.159047 | -118.186462 | 0 | NaN |
2 | 2 | 2019-01-01 00:00:51 | 38859492057661 | fraud_Lind-Buckridge | entertainment | 220.11 | Edward | Sanchez | M | 594 White Dale Suite 530 | ... | -112.2620 | 4154 | Nature conservation officer | 1962-01-19 | a1a22d70485983eac12b5b88dad1cf95 | 1325376051 | 43.150704 | -112.154481 | 0 | 83236.0 |
3 | 3 | 2019-01-01 00:01:16 | 3534093764340240 | fraud_Kutch, Hermiston and Farrell | gas_transport | 45.00 | Jeremy | White | M | 9443 Cynthia Court Apt. 038 | ... | -112.1138 | 1939 | Patent attorney | 1967-01-12 | 6b849c168bdad6f867558c3793159a81 | 1325376076 | 47.034331 | -112.561071 | 0 | NaN |
4 | 4 | 2019-01-01 00:03:06 | 375534208663984 | fraud_Keeling-Crist | misc_pos | 41.96 | Tyler | Garcia | M | 408 Bradley Rest | ... | -79.4629 | 99 | Dance movement psychotherapist | 1986-03-28 | a41d7549acf90789359a9aa5346dcb46 | 1325376186 | 38.674999 | -78.632459 | 0 | 22844.0 |
5 | 5 | 2019-01-01 00:04:08 | 4767265376804500 | fraud_Stroman, Hudson and Erdman | gas_transport | 94.63 | Jennifer | Conner | F | 4655 David Island | ... | -75.2045 | 2158 | Transport planner | 1961-06-19 | 189a841a0a8ba03058526bcfe566aab5 | 1325376248 | 40.653382 | -76.152667 | 0 | 17972.0 |
6 | 6 | 2019-01-01 00:04:42 | 30074693890476 | fraud_Rowe-Vandervort | grocery_net | 44.54 | Kelsey | Richards | F | 889 Sarah Station Suite 624 | ... | -100.9893 | 2691 | Arboriculturist | 1993-08-16 | 83ec1cc84142af6e2acf10c44949e720 | 1325376282 | 37.162705 | -100.153370 | 0 | NaN |
7 | 7 | 2019-01-01 00:05:08 | 6011360759745864 | fraud_Corwin-Collins | gas_transport | 71.65 | Steven | Williams | M | 231 Flores Pass Suite 720 | ... | -78.6003 | 6018 | Designer, multimedia | 1947-08-21 | 6d294ed2cc447d2c71c7171a3d54967c | 1325376308 | 38.948089 | -78.540296 | 0 | 22644.0 |
8 | 8 | 2019-01-01 00:05:18 | 4922710831011201 | fraud_Herzog Ltd | misc_pos | 4.27 | Heather | Chase | F | 6888 Hicks Stream Suite 954 | ... | -79.6607 | 1472 | Public affairs consultant | 1941-03-07 | fc28024ce480f8ef21a32d64c93a29f5 | 1325376318 | 40.351813 | -79.958146 | 0 | 15236.0 |
9 | 9 | 2019-01-01 00:06:01 | 2720830304681674 | fraud_Schoen, Kuphal and Nitzsche | grocery_pos | 198.39 | Melissa | Aguilar | F | 21326 Taylor Squares Suite 708 | ... | -87.3490 | 151785 | Pathologist | 1974-03-28 | 3b9014ea8fb80bd65de0b1463b00b00e | 1325376361 | 37.179198 | -87.485381 | 0 | 42442.0 |
10 rows × 24 columns
Questions¶
We are most interested in how accurately can a machine learning model predict fraudulent transactions within bank transaction data, and which features contribute most significantly to model performance.
- What future transactions are fraudulent?
- Are there specific times of day or days of the week with higher fraud incidence?
- Are fraudulent transactions more likely to occur in specific merchants or cardholders (category, age, occupancy, etc.)?
- Is there a typical range of transaction amounts that are associated with fraud?
- Is there a relationship between merchant location, cardholder location, and transaction location in fraudulent transactions?
- Which variable has the most impact on a transaction being fraudulent?
- What machine learning models are most effective in detecting fraud?
Variables¶
- trans_date_trans_time: Timestamp of the transaction
- cc_num: Credit card number (hashed or anonymized)
- merchant: Merchant or store where the transaction occurred
- category: Type of transaction (e.g., grocery, entertainment)
- amt: Amount of the transaction
- gender: Gender of the cardholder
- job: Occupation of the cardholder
- dob: Date of birth of the cardholder
- street, city, state, zip: Address details of the cardholder
- lat, long: Geographical coordinates of the transaction
- city_pop: Population of the city where the transaction occurred
- merch_lat, merch_long, merch_zipcode: Geographical coordinates of the merchant
- is_fraud: Indicator of whether the transaction is fraudulentud).
Methods¶
- Logistic Regression: An effective model to binary classification problems like fraud detection. It is highly explanatory and can be used to understande if which features are most relevent to fraud.
- Decision Tree: Allow for easy visualization of decision making paths. For example, we can see if the probability of a fraud increases significantly when the transaction amount exceeds a certain value.
- k-NN classification: Another possible method we can use for training fraud detection.
- Train-Validation-Test Split or Regularization: Improve training accuracy and address overfitting.
- Gradient Boosting: We can also consider applying gradient boosting, if we cover it in class, because it can handle large datasets effectively.