Group 15
Jing Zhang
Vasu Janjrukia
Yitong Fu
Data
The dataset is from UCI Machine Learning Repository. It contains data of students in different undergraduate degrees (e.g., design, social services, technologies) from a higher education institution. Below is our code to read the data.
import pandas as pd
link = "https://archive.ics.uci.edu/static/public/697/predict+students+dropout+and+academic+success.zip"
dataset = pd.read_csv(link, sep=";")
Descriptions of Question(s)
School dropout in higher education is an obstacle to economic growth, employment,and productivity, directly impacting the lives of students and their families, higher education institutions, and society as a whole (Martins et.al., 2021). Identifying students at risk is helpful to exert early interventions and reduce dropout rates accordingly. Thus, our project aims at predicting whether a student will drop out, or remain enrolled and graduate in their undergraduate. We ask the following questions:
Q1: Can we accurately predict whether a student will drop out based on available data?
Q2: What are the most important factors influencing student retention and success?
Descriptions of Variable(s)
The dataset consists of 4,424 records and 37 variables collected from students enrolled in various undergraduate programs. It includes:
Mathod
We will first prepare the data by removing any missing values and converting “Target” (outcome variable) into a binary variable, where dropout = 0, and enrolled/graduate = 1. To identify the most predictive factors for student dropout, we will make correlation plots and combine such insights with literature to inform variable selection. We will primarily use logistic regression as it fits for binary classification tasks. Additionally, we will compare its performance with other models such as Decision Trees. Finally, we will evaluate the model, interpret the predictors, and discuss implications.