Mutating Joins

Karl (content from chapter 19 of r4ds)

Example

We often have multiple tables, with different types of information, and some variables in common. For example:

library(nycflights13)
weather
# A tibble: 26,115 × 15
   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
 1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
 2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
 3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
 4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
 5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
 6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
 7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
 8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
 9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
# ℹ 26,105 more rows
# ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
#   visib <dbl>, time_hour <dttm>

Another example:

library(nycflights13)
airports
# A tibble: 1,458 × 8
   faa   name                             lat    lon   alt    tz dst   tzone    
   <chr> <chr>                          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
 1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/…
 2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/…
 3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/…
 4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/…
 5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/…
 6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/…
 7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/…
 8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/…
 9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/…
10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/…
# ℹ 1,448 more rows

Yet another example:

planes
# A tibble: 3,322 × 9
   tailnum  year type              manufacturer model engines seats speed engine
   <chr>   <int> <chr>             <chr>        <chr>   <int> <int> <int> <chr> 
 1 N10156   2004 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
 2 N102UW   1998 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 3 N103US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 4 N104UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 5 N10575   2002 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
 6 N105UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 7 N107US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 8 N108UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 9 N109UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
10 N110UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
# ℹ 3,312 more rows

And a final example:

airlines
# A tibble: 16 × 2
   carrier name                       
   <chr>   <chr>                      
 1 9E      Endeavor Air Inc.          
 2 AA      American Airlines Inc.     
 3 AS      Alaska Airlines Inc.       
 4 B6      JetBlue Airways            
 5 DL      Delta Air Lines Inc.       
 6 EV      ExpressJet Airlines Inc.   
 7 F9      Frontier Airlines Inc.     
 8 FL      AirTran Airways Corporation
 9 HA      Hawaiian Airlines Inc.     
10 MQ      Envoy Air                  
11 OO      SkyWest Airlines Inc.      
12 UA      United Air Lines Inc.      
13 US      US Airways Inc.            
14 VX      Virgin America             
15 WN      Southwest Airlines Co.     
16 YV      Mesa Airlines Inc.         

A diagram

Relational diagram

Primary Keys

  • A primary key is a variable or set of variables that uniquely identifies each observation
  • When more than one variable is needed, the key is called a compound key.

Q: What are the primary keys for airports, planes, airlines?

Foreign Keys

  • A foreign key is a variable (or set of variables) that corresponds to a primary key in another table.
  • Typically, we want the foreign key to be the same name as the primary key. It makes it easier to interpret what’s happening.

Q: Find a foreign key that is not identical to it’s primary key. Why might this be good design?

Surrogate Keys

There isn’t a very good primary key for flights. So, you could make one:

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
flights |> mutate(id = row_number())
# A tibble: 336,776 × 20
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, id <int>

Basic Joins

There are a bunch of different joins:

left_join(), inner_join(), right_join(), full_join(), semi_join(), and anti_join()

You should learn them all! But in the interest of time, we will focus on my favorite, left_join.

left_join

  • left_join is a mutating join because it adds columns, like mutate.
  • it takes two or three arguments.
flights_small = flights |> 
  select(year, time_hour, origin, dest, tailnum, carrier)

# left_join(flights_small, airlines)
# or, with a pipe:
flights_small |>
  left_join(airlines)
Joining with `by = join_by(carrier)`
# A tibble: 336,776 × 7
    year time_hour           origin dest  tailnum carrier name                  
   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>                 
 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      United Air Lines Inc. 
 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      United Air Lines Inc. 
 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      American Airlines Inc.
 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      JetBlue Airways       
 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Delta Air Lines Inc.  
 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      United Air Lines Inc. 
 7  2013 2013-01-01 06:00:00 EWR    FLL   N516JB  B6      JetBlue Airways       
 8  2013 2013-01-01 06:00:00 LGA    IAD   N829AS  EV      ExpressJet Airlines I…
 9  2013 2013-01-01 06:00:00 JFK    MCO   N593JB  B6      JetBlue Airways       
10  2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA      American Airlines Inc.
# ℹ 336,766 more rows

More examples:

weather_small = weather |> select(origin, time_hour, temp, wind_speed)
flights_small |> 
  left_join(weather_small)
Joining with `by = join_by(time_hour, origin)`
# A tibble: 336,776 × 8
    year time_hour           origin dest  tailnum carrier  temp wind_speed
   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <dbl>      <dbl>
 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA       39.0       12.7
 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA       39.9       15.0
 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA       39.0       15.0
 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6       39.0       15.0
 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL       39.9       16.1
 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA       39.0       12.7
 7  2013 2013-01-01 06:00:00 EWR    FLL   N516JB  B6       37.9       11.5
 8  2013 2013-01-01 06:00:00 LGA    IAD   N829AS  EV       39.9       16.1
 9  2013 2013-01-01 06:00:00 JFK    MCO   N593JB  B6       37.9       13.8
10  2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA       39.9       16.1
# ℹ 336,766 more rows

More examples:

planes_small = planes |> select(tailnum, type, engines, seats)
flights_small |> 
  left_join(planes_small)
Joining with `by = join_by(tailnum)`
# A tibble: 336,776 × 9
    year time_hour           origin dest  tailnum carrier type     engines seats
   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>      <int> <int>
 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      Fixed w…       2   149
 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      Fixed w…       2   149
 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      Fixed w…       2   178
 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      Fixed w…       2   200
 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Fixed w…       2   178
 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Fixed w…       2   191
 7  2013 2013-01-01 06:00:00 EWR    FLL   N516JB  B6      Fixed w…       2   200
 8  2013 2013-01-01 06:00:00 LGA    IAD   N829AS  EV      Fixed w…       2    55
 9  2013 2013-01-01 06:00:00 JFK    MCO   N593JB  B6      Fixed w…       2   200
10  2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA      <NA>          NA    NA
# ℹ 336,766 more rows

What if there is no match?

flights_small |> 
  filter(tailnum == "N3ALAA") |> 
  left_join(planes_small)
Joining with `by = join_by(tailnum)`
# A tibble: 63 × 9
    year time_hour           origin dest  tailnum carrier type  engines seats
   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>   <int> <int>
 1  2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA      <NA>       NA    NA
 2  2013 2013-01-02 18:00:00 LGA    ORD   N3ALAA  AA      <NA>       NA    NA
 3  2013 2013-01-03 06:00:00 LGA    ORD   N3ALAA  AA      <NA>       NA    NA
 4  2013 2013-01-07 19:00:00 LGA    ORD   N3ALAA  AA      <NA>       NA    NA
 5  2013 2013-01-08 17:00:00 JFK    ORD   N3ALAA  AA      <NA>       NA    NA
 6  2013 2013-01-16 06:00:00 LGA    ORD   N3ALAA  AA      <NA>       NA    NA
 7  2013 2013-01-20 18:00:00 LGA    ORD   N3ALAA  AA      <NA>       NA    NA
 8  2013 2013-01-22 17:00:00 JFK    ORD   N3ALAA  AA      <NA>       NA    NA
 9  2013 2013-10-11 06:00:00 EWR    MIA   N3ALAA  AA      <NA>       NA    NA
10  2013 2013-10-14 08:00:00 JFK    BOS   N3ALAA  AA      <NA>       NA    NA
# ℹ 53 more rows

What if there are multiple matches??

Put this in Rstudio to see what happens… you get a helpful error!

flights_no_good = flights |> select(year, month, day, hour, dep_delay)
weather_no_good = weather |> select(year, month, day, hour, precip, origin)
flights_no_good |> 
  left_join(weather_no_good)
Joining with `by = join_by(year, month, day, hour)`
Warning in left_join(flights_no_good, weather_no_good): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 5 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
# A tibble: 1,006,987 × 7
    year month   day  hour dep_delay precip origin
   <int> <int> <int> <dbl>     <dbl>  <dbl> <chr> 
 1  2013     1     1     5         2      0 EWR   
 2  2013     1     1     5         2      0 JFK   
 3  2013     1     1     5         2      0 LGA   
 4  2013     1     1     5         4      0 EWR   
 5  2013     1     1     5         4      0 JFK   
 6  2013     1     1     5         4      0 LGA   
 7  2013     1     1     5         2      0 EWR   
 8  2013     1     1     5         2      0 JFK   
 9  2013     1     1     5         2      0 LGA   
10  2013     1     1     5        -1      0 EWR   
# ℹ 1,006,977 more rows

left_join(left_tibble, right_tibble, by = …)

left_join helps us by choosing the variables to “join by”. Sometimes, it gets it wrong!

flights_small |> 
  left_join(planes)
Joining with `by = join_by(year, tailnum)`
# A tibble: 336,776 × 13
    year time_hour           origin dest  tailnum carrier type  manufacturer
   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr> <chr>       
 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      <NA>  <NA>        
 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      <NA>  <NA>        
 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      <NA>  <NA>        
 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      <NA>  <NA>        
 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      <NA>  <NA>        
 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      <NA>  <NA>        
 7  2013 2013-01-01 06:00:00 EWR    FLL   N516JB  B6      <NA>  <NA>        
 8  2013 2013-01-01 06:00:00 LGA    IAD   N829AS  EV      <NA>  <NA>        
 9  2013 2013-01-01 06:00:00 JFK    MCO   N593JB  B6      <NA>  <NA>        
10  2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA      <NA>  <NA>        
# ℹ 336,766 more rows
# ℹ 5 more variables: model <chr>, engines <int>, seats <int>, speed <int>,
#   engine <chr>

Why is this wrong?

When the default “by” doesn’t work…

You might want to specify the by argument…

flights_small |> 
  left_join(planes, join_by(tailnum))
# A tibble: 336,776 × 14
   year.x time_hour           origin dest  tailnum carrier year.y type          
    <int> <dttm>              <chr>  <chr> <chr>   <chr>    <int> <chr>         
 1   2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA        1999 Fixed wing mu…
 2   2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA        1998 Fixed wing mu…
 3   2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA        1990 Fixed wing mu…
 4   2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6        2012 Fixed wing mu…
 5   2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL        1991 Fixed wing mu…
 6   2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA        2012 Fixed wing mu…
 7   2013 2013-01-01 06:00:00 EWR    FLL   N516JB  B6        2000 Fixed wing mu…
 8   2013 2013-01-01 06:00:00 LGA    IAD   N829AS  EV        1998 Fixed wing mu…
 9   2013 2013-01-01 06:00:00 JFK    MCO   N593JB  B6        2004 Fixed wing mu…
10   2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA          NA <NA>          
# ℹ 336,766 more rows
# ℹ 6 more variables: manufacturer <chr>, model <chr>, engines <int>,
#   seats <int>, speed <int>, engine <chr>

What if the keys don’t have the same name?

flights_small |> 
  left_join(airports, join_by(dest == faa))
# A tibble: 336,776 × 13
    year time_hour           origin dest  tailnum carrier name         lat   lon
   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>      <dbl> <dbl>
 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      George Bu…  30.0 -95.3
 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      George Bu…  30.0 -95.3
 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      Miami Intl  25.8 -80.3
 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      <NA>        NA    NA  
 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Hartsfiel…  33.6 -84.4
 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Chicago O…  42.0 -87.9
 7  2013 2013-01-01 06:00:00 EWR    FLL   N516JB  B6      Fort Laud…  26.1 -80.2
 8  2013 2013-01-01 06:00:00 LGA    IAD   N829AS  EV      Washingto…  38.9 -77.5
 9  2013 2013-01-01 06:00:00 JFK    MCO   N593JB  B6      Orlando I…  28.4 -81.3
10  2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA      Chicago O…  42.0 -87.9
# ℹ 336,766 more rows
# ℹ 4 more variables: alt <dbl>, tz <dbl>, dst <chr>, tzone <chr>

multiple variables?

Just add commas…

flights |> 
  left_join(weather, join_by(year, month, day, hour, origin))
# A tibble: 336,776 × 29
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 21 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour.x <dttm>, temp <dbl>, dewp <dbl>,
#   humid <dbl>, wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
#   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour.y <dttm>

Old code…

In older code you might see a different way of specifying the join keys, using a character vector:

  • by = "x" corresponds to join_by(x).
  • by = c("a" = "x") corresponds to join_by(a == x)