Introduction
This document provides a series of word problems designed to help you
practice using the dplyr
package with the flights dataset.
The problems range from simple tasks that require a single function to
more complex tasks that require piping (%>%
) together
multiple functions. All questions are referring to the
nycflights13
data.
Easy Problems
- Filter: Find all flights that departed before 6 AM
on January 1.
- Mutate: Add a new column that shows the flight
duration in hours (assume
air_time
is in minutes).
- Select: Retrieve only the columns
year
, month
, day
,
air_time
, and distance
for February
flights.
- Distinct: Find all unique airlines in the
dataset.
Medium Problems
- Group By and Summarize: Calculate the average delay
by carrier. Consider both
arr_delay
and
dep_delay
.
- Slice: Select the first 3 flights of each day in
January.
- Mutate and Filter: Add a column for the speed
(distance divided by air time in hours) of each flight and then find
flights that had an average speed lower than 300 miles per hour.
- Select and Distinct: Find distinct origin and
destination airport pairs.
Challenging Problems
- Group By, Summarize, and Filter: Find days with
more than 100 flights delayed by over 30 minutes. Display the date and
the number of such flights.
- Mutate, Filter, and Arrange: Calculate the total
delay (sum of
dep_delay
and arr_delay
) for
each flight, then find the top 10 most delayed flights.
- Group By, Summarize, and Mutate: For each carrier,
calculate the total number of flights and the average delay. Then, add a
column ranking carriers by the number of flights.
- Select, Group By, Summarize, and Slice: Find the
day with the highest average departure delay for each month. Select only
the relevant columns to display.
Complex Problems
- Group By, Summarize, Filter, and Arrange: Identify
the month with the highest average air time for flights departing from
JFK. Then, list the average air times in descending order.
- Mutate, Group By, Summarize, and Filter: For each
destination, calculate the average air time and the total number of
flights. Then, find destinations where the average air time is more than
3 hours and there are at least 50 flights heading there.
- Distinct, Group By, and Summarize: Identify unique
routes (combination of
origin
and destination
)
and calculate the average flight duration for each. Then, find routes
where the average flight duration is less than 2 hours.
Conclusion
These problems are designed to enhance your proficiency with
dplyr
functions such as filter()
,
mutate()
, select()
, distinct()
,
group_by()
, summarize()
, slice()
,
and ungroup()
. By working through these exercises, you will
become more adept at manipulating and analyzing data within the R
programming environment.