We will be analyzing airline on-time performance data sourced from the DOT’s Bureau of Transportation Statistics. We will be using this data to build models that predict the likelihood and duration of delays. We will also group airports/carriers based on similar delay patterns.
As mentioned, our data source is the Bureau of Transportation Statistics’s report on carrier on-time performance. We will be using data from August 2023 as this is the latest data available, and the large number of entries (>600,000) per month limits the number of months we can analyze.
The data includes variables on departure time and date, airline carrier, origin and destination airports, whether a flight was delayed 15+ minutes, total delay time, cancellation, and the sources of delays. We will use delay time and cancellation status for classification and regression analysis and look at the average of these variables for grouping airports and carriers.
import pandas as pd
import numpy as np
df_aug = pd.read_csv('./datasets/aug.csv')
print(f'Dataframe columns: {list(df_aug.columns)}')
print(f'\nDataframe shape: {df_aug.shape}')
df_aug.head()
Dataframe columns: ['YEAR', 'MONTH', 'DAY_OF_MONTH', 'DAY_OF_WEEK', 'FL_DATE', 'OP_UNIQUE_CARRIER', 'OP_CARRIER_FL_NUM', 'ORIGIN', 'ORIGIN_CITY_NAME', 'ORIGIN_STATE_NM', 'DEST', 'DEST_STATE_ABR', 'DEST_STATE_NM', 'CRS_DEP_TIME', 'DEP_DELAY', 'DEP_DELAY_NEW', 'DEP_DEL15', 'CANCELLED', 'CARRIER_DELAY', 'WEATHER_DELAY', 'NAS_DELAY', 'SECURITY_DELAY', 'LATE_AIRCRAFT_DELAY'] Dataframe shape: (602987, 23)
YEAR | MONTH | DAY_OF_MONTH | DAY_OF_WEEK | FL_DATE | OP_UNIQUE_CARRIER | OP_CARRIER_FL_NUM | ORIGIN | ORIGIN_CITY_NAME | ORIGIN_STATE_NM | ... | CRS_DEP_TIME | DEP_DELAY | DEP_DELAY_NEW | DEP_DEL15 | CANCELLED | CARRIER_DELAY | WEATHER_DELAY | NAS_DELAY | SECURITY_DELAY | LATE_AIRCRAFT_DELAY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2023 | 8 | 1 | 2 | 8/1/2023 12:00:00 AM | 9E | 4900 | RIC | Richmond, VA | Virginia | ... | 619 | -4.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN |
1 | 2023 | 8 | 1 | 2 | 8/1/2023 12:00:00 AM | 9E | 4901 | CLT | Charlotte, NC | North Carolina | ... | 1955 | -4.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN |
2 | 2023 | 8 | 1 | 2 | 8/1/2023 12:00:00 AM | 9E | 4901 | JFK | New York, NY | New York | ... | 1629 | -7.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN |
3 | 2023 | 8 | 1 | 2 | 8/1/2023 12:00:00 AM | 9E | 4902 | SYR | Syracuse, NY | New York | ... | 615 | -11.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN |
4 | 2023 | 8 | 1 | 2 | 8/1/2023 12:00:00 AM | 9E | 4903 | BHM | Birmingham, AL | Alabama | ... | 1108 | -5.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN |
5 rows × 23 columns
df_aug.describe()
YEAR | MONTH | DAY_OF_MONTH | DAY_OF_WEEK | OP_CARRIER_FL_NUM | CRS_DEP_TIME | DEP_DELAY | DEP_DELAY_NEW | DEP_DEL15 | CANCELLED | CARRIER_DELAY | WEATHER_DELAY | NAS_DELAY | SECURITY_DELAY | LATE_AIRCRAFT_DELAY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 602987.0 | 602987.0 | 602987.000000 | 602987.000000 | 602987.000000 | 602987.000000 | 594101.000000 | 594101.000000 | 594101.000000 | 602987.000000 | 128439.00000 | 128439.000000 | 128439.000000 | 128439.000000 | 128439.000000 |
mean | 2023.0 | 8.0 | 15.934484 | 3.884984 | 2338.640302 | 1337.394720 | 14.245425 | 17.300358 | 0.221735 | 0.015211 | 26.88382 | 3.736529 | 11.504185 | 0.165954 | 30.897593 |
std | 0.0 | 0.0 | 8.947471 | 1.938300 | 1580.512306 | 501.902422 | 60.608499 | 59.618909 | 0.415414 | 0.122391 | 84.94507 | 28.404783 | 29.912260 | 3.606032 | 68.645197 |
min | 2023.0 | 8.0 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | -59.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 2023.0 | 8.0 | 8.000000 | 2.000000 | 1048.000000 | 910.000000 | -5.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 2023.0 | 8.0 | 16.000000 | 4.000000 | 2094.000000 | 1326.000000 | -2.000000 | 0.000000 | 0.000000 | 0.000000 | 4.00000 | 0.000000 | 0.000000 | 0.000000 | 6.000000 |
75% | 2023.0 | 8.0 | 24.000000 | 5.000000 | 3410.000000 | 1749.000000 | 11.000000 | 11.000000 | 0.000000 | 0.000000 | 23.00000 | 0.000000 | 14.000000 | 0.000000 | 35.000000 |
max | 2023.0 | 8.0 | 31.000000 | 7.000000 | 8810.000000 | 2359.000000 | 3445.000000 | 3445.000000 | 1.000000 | 1.000000 | 3424.00000 | 1561.000000 | 1316.000000 | 805.000000 | 1865.000000 |
The questions we are looking to answer are as follows:
To answer these questions we will use several machine learning analysis tools and techniques learned in the course. For the classification of delays and cancellations we will try and select the best of decision trees, SVMs, and logistic regression all using ensemble learning. For regression we will use and test linear regression and kNN regression. These methods will be tested and selected using CV and hyperparameter tuning. For grouping we will use K-means to cluster airports/carriers.
Since the dataset comes with different types of delays, we are still figuring out how to narrow down and manipulate the current features in order to obtain the needed target feature of 'airline delay'.
For now, we're thinking of trying to obtain an external dataset like the 'weather conditions' which we can then compare to see if weather is influencing a flight's performance or not.
However, suppose we cannot find an appropriate dataset, we would change our question of 'predicting airline travel delays' to 'which airports/airlines are more likely to experience travel delays?'