dplyr verbs

Introduction

dplyr is a powerful package in R for data manipulation. This presentation covers the primary dplyr verbs and illustrates with some examples.

Common Characteristics of dplyr Verbs

First argument is always a data frame (or tibble).
Subsequent arguments specify columns to operate on.
The output is a new data frame.

The Pipe Operator

library(dplyr)
library(nycflights13)
flights

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

The Pipe Operator

Difficult to read without pipe:

filter(select(flights, origin, dest), dest == "IAH")

# A tibble: 7,198 × 2
   origin dest 
   <chr>  <chr>
 1 EWR    IAH  
 2 LGA    IAH  
 3 LGA    IAH  
 4 LGA    IAH  
 5 EWR    IAH  
 6 EWR    IAH  
 7 LGA    IAH  
 8 EWR    IAH  
 9 LGA    IAH  
10 EWR    IAH  
# ℹ 7,188 more rows

The Pipe Operator

Easy to read with pipe!

flights |>
  select(origin, dest) |> 
  filter(dest == "IAH")

# A tibble: 7,198 × 2
   origin dest 
   <chr>  <chr>
 1 EWR    IAH  
 2 LGA    IAH  
 3 LGA    IAH  
 4 LGA    IAH  
 5 EWR    IAH  
 6 EWR    IAH  
 7 LGA    IAH  
 8 EWR    IAH  
 9 LGA    IAH  
10 EWR    IAH  
# ℹ 7,188 more rows

dplyr Verb Groups

Verbs for rows: filter(), arrange()
Verbs for columns: mutate(), select()
Verbs for groups: group_by(), summarize()
Verbs for combining tables: left_join(), right_join()

Rows - filter() and arrange()

filter() selects rows based on condition. arrange() changes the order of rows.

Using filter()

# Keeping flights that departed more than 120 minutes late
flights |> 
  filter(dep_delay > 120)

# A tibble: 9,723 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      848           1835       853     1001           1950
 2  2013     1     1      957            733       144     1056            853
 3  2013     1     1     1114            900       134     1447           1222
 4  2013     1     1     1540           1338       122     2020           1825
 5  2013     1     1     1815           1325       290     2120           1542
 6  2013     1     1     1842           1422       260     1958           1535
 7  2013     1     1     1856           1645       131     2212           2005
 8  2013     1     1     1934           1725       129     2126           1855
 9  2013     1     1     1938           1703       155     2109           1823
10  2013     1     1     1942           1705       157     2124           1830
# ℹ 9,713 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Common Mistakes with filter()

Be careful with operators:

Use == for equality, not =.
Combine conditions correctly with & and |.

The arrange() Function

# Sorting by departure time
flights |> 
  arrange(year, month, day, dep_time)

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

The distinct() Function

# Finding unique rows
flights |> distinct()

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

The distinct() Function

# Finding unique rows
flights |> 
  select(origin, dest) |> 
  distinct()

# A tibble: 224 × 2
   origin dest 
   <chr>  <chr>
 1 EWR    IAH  
 2 LGA    IAH  
 3 JFK    MIA  
 4 JFK    BQN  
 5 LGA    ATL  
 6 EWR    ORD  
 7 EWR    FLL  
 8 LGA    IAD  
 9 JFK    MCO  
10 LGA    ORD  
# ℹ 214 more rows

Warm-ups

In a single pipeline for each condition, find all flights that meet the condition:

Had an arrival delay of two or more hours
Flew to Houston (IAH or HOU)
Were operated by United, American, or Delta
Departed in summer (July, August, and September)
Arrived more than two hours late, but didn’t leave late
Were delayed by at least an hour, but made up over 30 minutes in flight

Exercises for Section 3.2

Find flights with an arrival delay of two or more hours.
Sort flights to find the fastest flights. (Hint: Try including a math calculation inside of your function.)
Was there a flight on every day of 2013?
Which flights traveled the farthest distance? Which traveled the least distance?
Does it matter what order you used filter() and arrange() if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.

Columns - mutate(), select(), and rename()

mutate(): creates new columns.
select(): selects (retains) columns.
rename(): renames columns.

Using mutate()

# Adding new columns 'gain' and 'speed'
flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60
  )

# A tibble: 336,776 × 21
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 13 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, gain <dbl>, speed <dbl>

The select() Function

# Selecting specific columns
flights |> select(year, month, day)

# A tibble: 336,776 × 3
    year month   day
   <int> <int> <int>
 1  2013     1     1
 2  2013     1     1
 3  2013     1     1
 4  2013     1     1
 5  2013     1     1
 6  2013     1     1
 7  2013     1     1
 8  2013     1     1
 9  2013     1     1
10  2013     1     1
# ℹ 336,766 more rows

Renaming with rename()

# Renaming 'tailnum' to 'tail_num'
flights |> rename(tail_num = tailnum)

# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tail_num <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Using relocate()

# Moving 'air_time' to the front
flights |> relocate(air_time, .before = year)

# A tibble: 336,776 × 19
   air_time  year month   day dep_time sched_dep_time dep_delay arr_time
      <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1      227  2013     1     1      517            515         2      830
 2      227  2013     1     1      533            529         4      850
 3      160  2013     1     1      542            540         2      923
 4      183  2013     1     1      544            545        -1     1004
 5      116  2013     1     1      554            600        -6      812
 6      150  2013     1     1      554            558        -4      740
 7      158  2013     1     1      555            600        -5      913
 8       53  2013     1     1      557            600        -3      709
 9      140  2013     1     1      557            600        -3      838
10      138  2013     1     1      558            600        -2      753
# ℹ 336,766 more rows
# ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Conclusion

You really really really have to practice. Explore dplyr functions with tidy tuesday!

The Pipe Operator

library(dplyr)
library(nycflights13)
# Example of using the pipe operator
flights |>
  filter(dest == "IAH") |>
  group_by(year, month, day) |>
  summarize(arr_delay = mean(arr_delay, na.rm = TRUE))

# A tibble: 365 × 4
# Groups:   year, month [12]
    year month   day arr_delay
   <int> <int> <int>     <dbl>
 1  2013     1     1     17.8 
 2  2013     1     2      7   
 3  2013     1     3     18.3 
 4  2013     1     4     -3.2 
 5  2013     1     5     20.2 
 6  2013     1     6      9.28
 7  2013     1     7     -7.74
 8  2013     1     8      7.79
 9  2013     1     9     18.1 
10  2013     1    10      6.68
# ℹ 355 more rows