group_by and summarize

Karl Rohe (taking from r4ds)

We will use these packages:

library(dplyr)
library(nycflights13)

summarize

What is this output?

flights |> 
  summarize(
    avg_delay = mean(dep_delay, na.rm=TRUE)
  )
# A tibble: 1 × 1
  avg_delay
      <dbl>
1      12.6

summarize

How does the output of summarize change with a group_by?

flights |> 
  summarize(
    avg_delay = mean(dep_delay, na.rm=TRUE)
  )
# A tibble: 1 × 1
  avg_delay
      <dbl>
1      12.6
flights |> 
  group_by(month) |> 
  summarize(
    avg_delay = mean(dep_delay, na.rm=TRUE)
  )
# A tibble: 12 × 2
   month avg_delay
   <int>     <dbl>
 1     1     10.0 
 2     2     10.8 
 3     3     13.2 
 4     4     13.9 
 5     5     13.0 
 6     6     20.8 
 7     7     21.7 
 8     8     12.6 
 9     9      6.72
10    10      6.24
11    11      5.44
12    12     16.6 

group_by

Use group_by() to divide your dataset into groups meaningful for your analysis:

flights |> 
  group_by(month)
# A tibble: 336,776 × 19
# Groups:   month [12]
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Notice the 12 groups… While it prints out as one tibble, imagine it as 12 different tibbles. So, when we summarize, we get a summary for each of these 12 tibbles.

You can summarize with more than one “Statistic”

flights |> 
  group_by(month) |> 
  summarize(
    avg_delay = mean(dep_delay, na.rm = TRUE), 
    n = n()
  )
# A tibble: 12 × 3
   month avg_delay     n
   <int>     <dbl> <int>
 1     1     10.0  27004
 2     2     10.8  24951
 3     3     13.2  28834
 4     4     13.9  28330
 5     5     13.0  28796
 6     6     20.8  28243
 7     7     21.7  29425
 8     8     12.6  29327
 9     9      6.72 27574
10    10      6.24 28889
11    11      5.44 27268
12    12     16.6  28135

You can also group_by more than one variable…

flights |> 
  group_by(month, day) 
# A tibble: 336,776 × 19
# Groups:   month, day [365]
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

You can also group_by more than one variable…

flights |> 
  group_by(month, day) |> 
  summarize(
    avg_delay = mean(dep_delay, na.rm = TRUE), 
    n = n()
  )
# A tibble: 365 × 4
# Groups:   month [12]
   month   day avg_delay     n
   <int> <int>     <dbl> <int>
 1     1     1     11.5    842
 2     1     2     13.9    943
 3     1     3     11.0    914
 4     1     4      8.95   915
 5     1     5      5.73   720
 6     1     6      7.15   832
 7     1     7      5.42   933
 8     1     8      2.55   899
 9     1     9      2.28   902
10     1    10      2.84   932
# ℹ 355 more rows

The slice_ functions

Extract specific rows within each group:

  • df |> slice_head(n = 1) takes the first row from each group.
  • df |> slice_tail(n = 1) takes the last row in each group.
  • df |> slice_min(x, n = 1) takes the row with the smallest value of column x.
  • df |> slice_max(x, n = 1) takes the row with the largest value of column x.
  • df |> slice_sample(n = 1) takes one random row.

These are particularly nice within group_by… you get rows from each group.

Example

flights |> 
  group_by(dest) |> 
  slice_max(arr_delay, n = 1) |>
  relocate(dest)
# A tibble: 108 × 19
# Groups:   dest [105]
   dest   year month   day dep_time sched_dep_time dep_delay arr_time
   <chr> <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1 ABQ    2013     7    22     2145           2007        98      132
 2 ACK    2013     7    23     1139            800       219     1250
 3 ALB    2013     1    25      123           2000       323      229
 4 ANC    2013     8    17     1740           1625        75     2042
 5 ATL    2013     7    22     2257            759       898      121
 6 AUS    2013     7    10     2056           1505       351     2347
 7 AVL    2013     8    13     1156            832       204     1417
 8 BDL    2013     2    21     1728           1316       252     1839
 9 BGR    2013    12     1     1504           1056       248     1628
10 BHM    2013     4    10       25           1900       325      136
# ℹ 98 more rows
# ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

multiple groups and ungroup

If you group by multiple variables and then summarize, the resulting tibble will still have groups! It will peel off the last variable listed in the group_by:

daily <- flights |>  
  group_by(year, month, day)
daily
# A tibble: 336,776 × 19
# Groups:   year, month, day [365]
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
daily |> 
  summarize(n = n())
# A tibble: 365 × 4
# Groups:   year, month [12]
    year month   day     n
   <int> <int> <int> <int>
 1  2013     1     1   842
 2  2013     1     2   943
 3  2013     1     3   914
 4  2013     1     4   915
 5  2013     1     5   720
 6  2013     1     6   832
 7  2013     1     7   933
 8  2013     1     8   899
 9  2013     1     9   902
10  2013     1    10   932
# ℹ 355 more rows

I usually find this part annoying…

I ungroup right away. It is easy and fast and clear to simply make another grouping if you want another one.

daily |> ungroup()
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Guess what this does…

In words, what should the output of this be?

daily |> 
  ungroup() |>
  summarize(
    avg_delay = mean(dep_delay, na.rm = TRUE), 
    flights = n()
  )

Guess what this does…

In words, what should the output of this be?

daily |> 
  ungroup() |>
  summarize(
    avg_delay = mean(dep_delay, na.rm = TRUE), 
    flights = n()
  )
# A tibble: 1 × 2
  avg_delay flights
      <dbl>   <int>
1      12.6  336776

Class discussion:

  • Which origin airport has the worst average delays?

  • Challenge: can you disentangle the effects of bad destination airports vs. bad origin? Why/why not?

  • What if we replaced “origin airport” with “carrier”?

  • What role might weather play? Is this a confounding variable? Discuss weather at origins (JFK, LGA, EWR) and weather at destination airports.

First step…