Introduction

This document provides a series of word problems designed to help you practice using the dplyr package with the flights dataset. The problems range from simple tasks that require a single function to more complex tasks that require piping (%>%) together multiple functions. All questions are referring to the nycflights13 data.

Easy Problems

  1. Filter: Find all flights that departed before 6 AM on January 1.
  2. Mutate: Add a new column that shows the flight duration in hours (assume air_time is in minutes).
  3. Select: Retrieve only the columns year, month, day, air_time, and distance for February flights.
  4. Distinct: Find all unique airlines in the dataset.

Medium Problems

  1. Group By and Summarize: Calculate the average delay by carrier. Consider both arr_delay and dep_delay.
  2. Slice: Select the first 3 flights of each day in January.
  3. Mutate and Filter: Add a column for the speed (distance divided by air time in hours) of each flight and then find flights that had an average speed lower than 300 miles per hour.
  4. Select and Distinct: Find distinct origin and destination airport pairs.

Challenging Problems

  1. Group By, Summarize, and Filter: Find days with more than 100 flights delayed by over 30 minutes. Display the date and the number of such flights.
  2. Mutate, Filter, and Arrange: Calculate the total delay (sum of dep_delay and arr_delay) for each flight, then find the top 10 most delayed flights.
  3. Group By, Summarize, and Mutate: For each carrier, calculate the total number of flights and the average delay. Then, add a column ranking carriers by the number of flights.
  4. Select, Group By, Summarize, and Slice: Find the day with the highest average departure delay for each month. Select only the relevant columns to display.

Complex Problems

  1. Group By, Summarize, Filter, and Arrange: Identify the month with the highest average air time for flights departing from JFK. Then, list the average air times in descending order.
  2. Mutate, Group By, Summarize, and Filter: For each destination, calculate the average air time and the total number of flights. Then, find destinations where the average air time is more than 3 hours and there are at least 50 flights heading there.
  3. Distinct, Group By, and Summarize: Identify unique routes (combination of origin and destination) and calculate the average flight duration for each. Then, find routes where the average flight duration is less than 2 hours.

Conclusion

These problems are designed to enhance your proficiency with dplyr functions such as filter(), mutate(), select(), distinct(), group_by(), summarize(), slice(), and ungroup(). By working through these exercises, you will become more adept at manipulating and analyzing data within the R programming environment.