STAT 451 Proposal: Predicting and Classifying Air Quality Index (AQI) Using Global Weather Data¶
*Group 18:* Tan Bui, Lacey Dinh, Diego Ugaz, Maddie Young, Minji Suh
We all want to start a brand new day with good weather, but how do we quantify that "The air here is so fresh" feeling? We plan to investigate how meteorological and environmental features contribute to air quality, as well as how we could utilize them to estimate the air quality of a particular region.
Dataset¶
For this project, we will use the Global Weather Repository data set from Kaggle. This data set includes over 40 features, such as temperature, wind speed, pressure, precipitation, humidity, and visibility, along with measurements of airborne pollutants such as Ozone, Sulfur Dioxide, and Nitrogen Dioxide.
This data set is helpful for analyzing global weather patterns and understanding the impacts of various environmental factors on air quality.
Data Exploration¶
We will first load and explore the data set to understand its structure and identify key variables which will help answer our questions.
import pandas as pd
from IPython.display import display
df = pd.read_csv('GlobalWeatherRepository.csv')
display(df.head(n=10))
print(f'The Global Weather Repository data set has {df.shape[1]} features and {df.shape[0]} examples.')
country | location_name | latitude | longitude | timezone | last_updated_epoch | last_updated | temperature_celsius | temperature_fahrenheit | condition_text | ... | air_quality_PM2.5 | air_quality_PM10 | air_quality_us-epa-index | air_quality_gb-defra-index | sunrise | sunset | moonrise | moonset | moon_phase | moon_illumination | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | Kabul | 34.52 | 69.18 | Asia/Kabul | 1715849100 | 2024-05-16 13:15 | 26.6 | 79.8 | Partly Cloudy | ... | 8.4 | 26.6 | 1 | 1 | 04:50 AM | 06:50 PM | 12:12 PM | 01:11 AM | Waxing Gibbous | 55 |
1 | Albania | Tirana | 41.33 | 19.82 | Europe/Tirane | 1715849100 | 2024-05-16 10:45 | 19.0 | 66.2 | Partly cloudy | ... | 1.1 | 2.0 | 1 | 1 | 05:21 AM | 07:54 PM | 12:58 PM | 02:14 AM | Waxing Gibbous | 55 |
2 | Algeria | Algiers | 36.76 | 3.05 | Africa/Algiers | 1715849100 | 2024-05-16 09:45 | 23.0 | 73.4 | Sunny | ... | 10.4 | 18.4 | 1 | 1 | 05:40 AM | 07:50 PM | 01:15 PM | 02:14 AM | Waxing Gibbous | 55 |
3 | Andorra | Andorra La Vella | 42.50 | 1.52 | Europe/Andorra | 1715849100 | 2024-05-16 10:45 | 6.3 | 43.3 | Light drizzle | ... | 0.7 | 0.9 | 1 | 1 | 06:31 AM | 09:11 PM | 02:12 PM | 03:31 AM | Waxing Gibbous | 55 |
4 | Angola | Luanda | -8.84 | 13.23 | Africa/Luanda | 1715849100 | 2024-05-16 09:45 | 26.0 | 78.8 | Partly cloudy | ... | 183.4 | 262.3 | 5 | 10 | 06:12 AM | 05:55 PM | 01:17 PM | 12:38 AM | Waxing Gibbous | 55 |
5 | Antigua and Barbuda | Saint John's | 17.12 | -61.85 | America/Antigua | 1715849100 | 2024-05-16 04:45 | 26.0 | 78.8 | Partly cloudy | ... | 1.2 | 4.5 | 1 | 1 | 05:36 AM | 06:32 PM | 01:05 PM | 01:14 AM | Waxing Gibbous | 55 |
6 | Argentina | Buenos Aires | -34.59 | -58.67 | America/Argentina/Buenos_Aires | 1715849100 | 2024-05-16 05:45 | 8.0 | 46.4 | Clear | ... | 4.0 | 5.3 | 1 | 1 | 07:43 AM | 05:59 PM | 02:36 PM | 01:04 AM | Waxing Gibbous | 55 |
7 | Armenia | Yerevan | 40.18 | 44.51 | Asia/Yerevan | 1715849100 | 2024-05-16 12:45 | 19.0 | 66.2 | Partly cloudy | ... | 0.8 | 0.9 | 1 | 1 | 05:45 AM | 08:12 PM | 01:17 PM | 02:31 AM | Waxing Gibbous | 55 |
8 | Australia | Canberra | -35.28 | 149.22 | Australia/Sydney | 1715849100 | 2024-05-16 18:45 | 9.0 | 48.2 | Clear | ... | 3.7 | 5.4 | 1 | 1 | 06:52 AM | 05:07 PM | 01:31 PM | No moonset | Waxing Gibbous | 55 |
9 | Austria | Vienna | 48.20 | 16.37 | Europe/Vienna | 1715849100 | 2024-05-16 10:45 | 16.0 | 60.8 | Partly cloudy | ... | 3.7 | 4.4 | 1 | 1 | 05:14 AM | 08:29 PM | 01:00 PM | 02:42 AM | Waxing Gibbous | 55 |
10 rows × 41 columns
The Global Weather Repository data set has 41 features and 34914 examples.
display(df.columns) # list of variables
Index(['country', 'location_name', 'latitude', 'longitude', 'timezone', 'last_updated_epoch', 'last_updated', 'temperature_celsius', 'temperature_fahrenheit', 'condition_text', 'wind_mph', 'wind_kph', 'wind_degree', 'wind_direction', 'pressure_mb', 'pressure_in', 'precip_mm', 'precip_in', 'humidity', 'cloud', 'feels_like_celsius', 'feels_like_fahrenheit', 'visibility_km', 'visibility_miles', 'uv_index', 'gust_mph', 'gust_kph', 'air_quality_Carbon_Monoxide', 'air_quality_Ozone', 'air_quality_Nitrogen_dioxide', 'air_quality_Sulphur_dioxide', 'air_quality_PM2.5', 'air_quality_PM10', 'air_quality_us-epa-index', 'air_quality_gb-defra-index', 'sunrise', 'sunset', 'moonrise', 'moonset', 'moon_phase', 'moon_illumination'], dtype='object')
The provided variables can be categorized into one of the following groups:
- Time and Location: e.g.
latitude
,timezone
- Weather: e.g.
temperature_celsius
,wind_mph
,precip_mm
. Note that one feature may be listed as multiple variables with different units. - Pollutant measurement: e.g.
air_quality_Carbon_Monoxide
,air_quality_PM2.5
- Astronomy: e.g.
sunrise
,moon_phase
Questions¶
To more accurately analyze how air quality relates to environmental features, we plan to focus on a few specific questions.
- Which weather factors play a huge role in determining the AQI level?
- Our key variables will be temperature, humidity, wind speed, visibility and pollutant levels (e.g. Ozone, Nitrogen Dioxide). In the dataset above, they mainly belong to the second and third group.
- Our main model will be based on regression, involving linear regression, random forest and gradient boosting.
- How do we classify AQI levels based on weather and pollutant data?
- Our key variables will be temperature, humidity, visibility and pollutant levels.
- Our main method will be based on classification, involving logistic regression, k-NN, SVM and random forest.
- Do regions with similar climate patterns (e.g. temperature and precipitation) have similar air quality profiles?
- Our key variables will be location data, temperature, precipitation, humidity. Variables in the first group will also be utilized for this question.
- Our main method will be based on clustering using k-NN.
Key Variables Summary¶
Once we have identified key variables, we check for potential missing data and perform a quick statistical summary.
# Key features for analysis based on research questions
key_features = [
'temperature_celsius', 'humidity', 'wind_kph', 'pressure_mb', 'visibility_km',
'air_quality_Ozone', 'air_quality_Nitrogen_dioxide', 'air_quality_Sulphur_dioxide',
'air_quality_PM2.5', 'air_quality_PM10', 'air_quality_us-epa-index',
'latitude', 'longitude', 'precip_mm'
]
# Filter the dataset to include only relevant columns
df_filtered = df[key_features].copy()
# Overview of missing values
print("Missing Values:")
display(df_filtered.isnull().sum())
Missing Values:
temperature_celsius 0 humidity 0 wind_kph 0 pressure_mb 0 visibility_km 0 air_quality_Ozone 0 air_quality_Nitrogen_dioxide 0 air_quality_Sulphur_dioxide 0 air_quality_PM2.5 0 air_quality_PM10 0 air_quality_us-epa-index 0 latitude 0 longitude 0 precip_mm 0 dtype: int64
# Brief statistical summary
print("Statistical Summary:")
display(df_filtered.describe())
Statistical Summary:
temperature_celsius | humidity | wind_kph | pressure_mb | visibility_km | air_quality_Ozone | air_quality_Nitrogen_dioxide | air_quality_Sulphur_dioxide | air_quality_PM2.5 | air_quality_PM10 | air_quality_us-epa-index | latitude | longitude | precip_mm | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 34914.000000 | 34914.000000 | 34914.000000 | 34914.000000 | 34914.000000 | 34914.000000 | 34914.00000 | 34914.000000 | 34914.000000 | 34914.000000 | 34914.000000 | 34914.000000 | 34914.000000 | 34914.000000 |
mean | 25.112602 | 61.753623 | 13.545772 | 1012.966604 | 9.742327 | 63.257607 | 11.34178 | 8.109656 | 19.416344 | 36.604773 | 1.505929 | 19.142624 | 22.148852 | 0.155504 |
std | 7.491479 | 24.865681 | 17.936863 | 6.540543 | 2.339220 | 40.919080 | 23.47284 | 56.927274 | 46.464155 | 78.482634 | 0.858093 | 24.489479 | 65.782518 | 0.678246 |
min | -12.100000 | 2.000000 | 3.600000 | 971.000000 | 0.000000 | 0.000000 | 0.00000 | -9999.000000 | 0.370000 | 0.500000 | 1.000000 | -41.300000 | -175.200000 | 0.000000 |
25% | 21.200000 | 43.000000 | 6.800000 | 1010.000000 | 10.000000 | 34.700000 | 0.70000 | 0.500000 | 3.300000 | 5.900000 | 1.000000 | 3.750000 | -6.836100 | 0.000000 |
50% | 26.300000 | 66.000000 | 11.900000 | 1013.000000 | 10.000000 | 58.000000 | 2.30000 | 1.665000 | 9.250000 | 15.725000 | 1.000000 | 17.250000 | 23.320000 | 0.000000 |
75% | 29.300000 | 82.000000 | 19.100000 | 1016.000000 | 10.000000 | 86.000000 | 9.60000 | 6.105000 | 20.905000 | 36.630000 | 2.000000 | 40.400000 | 50.580000 | 0.030000 |
max | 49.200000 | 100.000000 | 2963.200000 | 1045.000000 | 32.000000 | 480.700000 | 427.70000 | 294.705000 | 1614.100000 | 1814.400000 | 6.000000 | 64.150000 | 179.220000 | 42.240000 |
# Correlation matrix for predictive feature selection and interaction analysis
print("Correlation Matrix (Environmental Factors & AQI):")
display(df_filtered.corr())
Correlation Matrix (Environmental Factors & AQI):
temperature_celsius | humidity | wind_kph | pressure_mb | visibility_km | air_quality_Ozone | air_quality_Nitrogen_dioxide | air_quality_Sulphur_dioxide | air_quality_PM2.5 | air_quality_PM10 | air_quality_us-epa-index | latitude | longitude | precip_mm | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
temperature_celsius | 1.000000 | -0.438570 | 0.053580 | -0.509654 | 0.051705 | 0.342989 | -0.107733 | -0.004718 | -0.043648 | 0.095873 | 0.048837 | -0.046598 | 0.146377 | -0.031297 |
humidity | -0.438570 | 1.000000 | -0.069026 | 0.055750 | -0.071507 | -0.527187 | 0.027391 | -0.020066 | -0.014249 | -0.129234 | -0.076155 | -0.146162 | -0.141501 | 0.190326 |
wind_kph | 0.053580 | -0.069026 | 1.000000 | -0.067979 | 0.014007 | 0.079195 | -0.077407 | -0.022106 | -0.045635 | -0.003391 | -0.057390 | 0.023856 | 0.019181 | -0.003366 |
pressure_mb | -0.509654 | 0.055750 | -0.067979 | 1.000000 | 0.017664 | -0.234029 | 0.014400 | 0.001259 | -0.010775 | -0.090576 | -0.052234 | -0.096744 | -0.219051 | -0.084674 |
visibility_km | 0.051705 | -0.071507 | 0.014007 | 0.017664 | 1.000000 | -0.017740 | -0.077513 | -0.024182 | -0.109708 | -0.089347 | -0.112055 | 0.007173 | 0.129837 | -0.053807 |
air_quality_Ozone | 0.342989 | -0.527187 | 0.079195 | -0.234029 | -0.017740 | 1.000000 | -0.178857 | -0.004617 | -0.000918 | 0.093300 | 0.090643 | 0.254975 | 0.067081 | -0.100028 |
air_quality_Nitrogen_dioxide | -0.107733 | 0.027391 | -0.077407 | 0.014400 | -0.077513 | -0.178857 | 1.000000 | 0.248641 | 0.534388 | 0.422084 | 0.608713 | 0.099425 | 0.143274 | -0.005762 |
air_quality_Sulphur_dioxide | -0.004718 | -0.020066 | -0.022106 | 0.001259 | -0.024182 | -0.004617 | 0.248641 | 1.000000 | 0.162848 | 0.150653 | 0.202968 | 0.013314 | 0.032617 | -0.006715 |
air_quality_PM2.5 | -0.043648 | -0.014249 | -0.045635 | -0.010775 | -0.109708 | -0.000918 | 0.534388 | 0.162848 | 1.000000 | 0.818482 | 0.710839 | -0.094156 | 0.027012 | -0.021123 |
air_quality_PM10 | 0.095873 | -0.129234 | -0.003391 | -0.090576 | -0.089347 | 0.093300 | 0.422084 | 0.150653 | 0.818482 | 1.000000 | 0.716870 | -0.058033 | 0.045148 | -0.040515 |
air_quality_us-epa-index | 0.048837 | -0.076155 | -0.057390 | -0.052234 | -0.112055 | 0.090643 | 0.608713 | 0.202968 | 0.710839 | 0.716870 | 1.000000 | -0.038416 | 0.102747 | -0.042637 |
latitude | -0.046598 | -0.146162 | 0.023856 | -0.096744 | 0.007173 | 0.254975 | 0.099425 | 0.013314 | -0.094156 | -0.058033 | -0.038416 | 1.000000 | -0.020518 | -0.020348 |
longitude | 0.146377 | -0.141501 | 0.019181 | -0.219051 | 0.129837 | 0.067081 | 0.143274 | 0.032617 | 0.027012 | 0.045148 | 0.102747 | -0.020518 | 1.000000 | 0.040331 |
precip_mm | -0.031297 | 0.190326 | -0.003366 | -0.084674 | -0.053807 | -0.100028 | -0.005762 | -0.006715 | -0.021123 | -0.040515 | -0.042637 | -0.020348 | 0.040331 | 1.000000 |
For the second classification question, we also look at all possible AQI levels (which ranges from 1 to 6).
# Unique AQI categories for classification modeling
print("Unique AQI Levels (US EPA Index):", df_filtered['air_quality_us-epa-index'].sort_values().unique())
Unique AQI Levels (US EPA Index): [1 2 3 4 5 6]
# concise summaries relevant to each research question
summary = {
"AQI Prediction": df_filtered[['temperature_celsius', 'humidity', 'wind_kph', 'pressure_mb',
'visibility_km', 'air_quality_Ozone', 'air_quality_Nitrogen_dioxide']].corrwith(
df_filtered['air_quality_us-epa-index']),
"AQI Classification": df_filtered[['temperature_celsius', 'humidity', 'visibility_km',
'air_quality_Ozone', 'air_quality_Nitrogen_dioxide']].apply(pd.Series.nunique),
"Geographic Clustering": df_filtered[['latitude', 'longitude', 'temperature_celsius',
'precip_mm', 'humidity']].describe(),
}
print("Summary Insights:\n")
for question, insight in summary.items():
print(question)
display(insight)
print()
Summary Insights: AQI Prediction
temperature_celsius 0.048837 humidity -0.076155 wind_kph -0.057390 pressure_mb -0.052234 visibility_km -0.112055 air_quality_Ozone 0.090643 air_quality_Nitrogen_dioxide 0.608713 dtype: float64
AQI Classification
temperature_celsius 516 humidity 99 visibility_km 62 air_quality_Ozone 622 air_quality_Nitrogen_dioxide 1186 dtype: int64
Geographic Clustering
latitude | longitude | temperature_celsius | precip_mm | humidity | |
---|---|---|---|---|---|
count | 34914.000000 | 34914.000000 | 34914.000000 | 34914.000000 | 34914.000000 |
mean | 19.142624 | 22.148852 | 25.112602 | 0.155504 | 61.753623 |
std | 24.489479 | 65.782518 | 7.491479 | 0.678246 | 24.865681 |
min | -41.300000 | -175.200000 | -12.100000 | 0.000000 | 2.000000 |
25% | 3.750000 | -6.836100 | 21.200000 | 0.000000 | 43.000000 |
50% | 17.250000 | 23.320000 | 26.300000 | 0.000000 | 66.000000 |
75% | 40.400000 | 50.580000 | 29.300000 | 0.030000 | 82.000000 |
max | 64.150000 | 179.220000 | 49.200000 | 42.240000 | 100.000000 |