In order to read and import the data, we can utilize the following code taken directly from the UCI website:¶
%%capture
pip install ucimlrepo
from ucimlrepo import fetch_ucirepo
# fetch dataset
student_performance = fetch_ucirepo(id=320)
# data (as pandas dataframes)
X = student_performance.data.features
y = student_performance.data.targets
X.head()
school | sex | age | address | famsize | Pstatus | Medu | Fedu | Mjob | Fjob | ... | higher | internet | romantic | famrel | freetime | goout | Dalc | Walc | health | absences | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | GP | F | 18 | U | GT3 | A | 4 | 4 | at_home | teacher | ... | yes | no | no | 4 | 3 | 4 | 1 | 1 | 3 | 4 |
1 | GP | F | 17 | U | GT3 | T | 1 | 1 | at_home | other | ... | yes | yes | no | 5 | 3 | 3 | 1 | 1 | 3 | 2 |
2 | GP | F | 15 | U | LE3 | T | 1 | 1 | at_home | other | ... | yes | yes | no | 4 | 3 | 2 | 2 | 3 | 3 | 6 |
3 | GP | F | 15 | U | GT3 | T | 4 | 2 | health | services | ... | yes | yes | yes | 3 | 2 | 2 | 1 | 1 | 5 | 0 |
4 | GP | F | 16 | U | GT3 | T | 3 | 3 | other | other | ... | yes | no | no | 4 | 3 | 2 | 1 | 2 | 5 | 0 |
5 rows × 30 columns
y.head()
G1 | G2 | G3 | |
---|---|---|---|
0 | 0 | 11 | 11 |
1 | 9 | 11 | 11 |
2 | 12 | 13 | 12 |
3 | 14 | 14 | 14 |
4 | 11 | 13 | 13 |
Below are all of the variable names and a description for each:¶
"school": student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira),
"sex": student's sex (binary: 'F' - female or 'M' - male),
"age": student's age (numeric: from 15 to 22),
"address": student's home address type (binary: 'U' - urban or 'R' - rural),
"famsize": family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3),
"Pstatus": parent's cohabitation status (binary: 'T' - living together or 'A' - apart),
"Medu": mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education),
"Fedu": father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education),
"Mjob": mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other'),
"Fjob": father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other'),
"reason": reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other'),
"guardian": student's guardian (nominal: 'mother', 'father' or 'other'),
"traveltime": home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour),
"studytime": weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours),
"failures": number of past class failures (numeric: n if 1<=n<3, else 4),
"schoolsup": extra educational support (binary: yes or no),
"famsup": family educational support (binary: yes or no),
"paid": extra paid classes within the course subject (Math or Portuguese) (binary: yes or no),
"activities": extra-curricular activities (binary: yes or no),
"nursery": attended nursery school (binary: yes or no),
"higher": wants to take higher education (binary: yes or no),
"internet": Internet access at home (binary: yes or no),
"romantic": with a romantic relationship (binary: yes or no),
"famrel": quality of family relationships (numeric: from 1 - very bad to 5 - excellent),
"freetime": free time after school (numeric: from 1 - very low to 5 - very high),
"goout": going out with friends (numeric: from 1 - very low to 5 - very high),
"Dalc": workday alcohol consumption (numeric: from 1 - very low to 5 - very high),
"Walc": weekend alcohol consumption (numeric: from 1 - very low to 5 - very high),
"health": current health status (numeric: from 1 - very bad to 5 - very good),
"absences": number of school absences (numeric: from 0 to 93),
"G1": first period grade (numeric: from 0 to 20),
"G2": second period grade (numeric: from 0 to 20),
"G3": final grade (numeric: from 0 to 20, output target)
Questions¶
- Which factors influence the student exam grades (G1, G2, G3) across the Mathematics and Portuguese subject studies?
- What familial, cultural, environmental factors influence exam performance across the Mathematics and Portuguese subject studies?
Methods¶
- Preprocessing:
- Data cleaning - rescaling and imputation (if necessary)
- We will probably have to do OHE as well to deal with some of the binary categorical variables
- Model selection:
- Decision Trees
- Ridge Regression
- LASSO Regression
- These models help identify significant predictors and can provide insights into how each variable impacts the target variable
- Cross-Validation and Hyperparameter Tuning:
- Evaluate model performance and avoid overfitting to ensure model performance generalizes well to new data
- This will involve creating training sets, validation sets, and test sets to completely evaluate each model
- Model Comparison:
- Compare models based on performance metrics (e.g., MSE for regression or accuracy for classification) to select the best model
- We may explore the confusion matrix as standards as well
- Visualizations:
- Feature importance plots for interpretability
- ROC curves (if classification is used)
- Regression line plots