Group 15

Jing Zhang

Vasu Janjrukia

Yitong Fu

Data

The dataset is from UCI Machine Learning Repository. It contains data of students in different undergraduate degrees (e.g., design, social services, technologies) from a higher education institution. Below is our code to read the data.

In [1]:
import pandas as pd
link = "https://archive.ics.uci.edu/static/public/697/predict+students+dropout+and+academic+success.zip"
dataset = pd.read_csv(link, sep=";")

Descriptions of Question(s)

School dropout in higher education is an obstacle to economic growth, employment,and productivity, directly impacting the lives of students and their families, higher education institutions, and society as a whole (Martins et.al., 2021). Identifying students at risk is helpful to exert early interventions and reduce dropout rates accordingly. Thus, our project aims at predicting whether a student will drop out, or remain enrolled and graduate in their undergraduate. We ask the following questions:

Q1: Can we accurately predict whether a student will drop out based on available data?

Q2: What are the most important factors influencing student retention and success?

Descriptions of Variable(s)

The dataset consists of 4,424 records and 37 variables collected from students enrolled in various undergraduate programs. It includes:

  • Demographic Information: Age at enrollment, nationality, gender, marital status.
  • Academic Background: Admission grade, previous qualification and its grade, academic performance during the first and second semesters.
  • Socioeconomic Factors: Parents’ qualifications and occupations, unemployment rate, inflation rate, GDP.
  • Financial Status: Scholarship holder status, tuition fees up-to-date, debtor status.
  • Enrollment and Attendance: Application mode and order, course type, attendance type (daytime/evening).
  • Performance Indicators: Number of enrolled, evaluated, approved, and credited curricular units across both semesters.
  • Outcome variable: “Target”, which classifies students into three categories based on their status at the end of the normal duration of the degree: Dropout, Enrolled, Graduate

Mathod

We will first prepare the data by removing any missing values and converting “Target” (outcome variable) into a binary variable, where dropout = 0, and enrolled/graduate = 1. To identify the most predictive factors for student dropout, we will make correlation plots and combine such insights with literature to inform variable selection. We will primarily use logistic regression as it fits for binary classification tasks. Additionally, we will compare its performance with other models such as Decision Trees. Finally, we will evaluate the model, interpret the predictors, and discuss implications.