Species using dplyr and purrr

We are going to use the species data from data carpentry to explore with the dplyr, tidyr and purrr packages. The data are available at http://kbroman.org/datacarp/portal_data_joined.csv. Here we attach the packages we will use with the library command.

library(dplyr)
library(tidyr)
library(purrr)
library(readr)

surveys <- read_csv("../data/surveys.csv")

## Parsed with column specification:
## cols(
##   record_id = col_integer(),
##   month = col_integer(),
##   day = col_integer(),
##   year = col_integer(),
##   plot_id = col_integer(),
##   species_id = col_character(),
##   sex = col_character(),
##   hindfoot_length = col_integer(),
##   weight = col_integer()
## )

surveys %>%
  filter(!(is.na(hindfoot_length) & is.na(weight)),
         !is.na(sex)) %>%
  group_by(species_id, sex) %>%
  summarize(hflen = mean(hindfoot_length, na.rm = TRUE),
            wt = mean(weight, na.rm = TRUE))

## # A tibble: 48 x 4
## # Groups:   species_id [?]
##    species_id   sex    hflen         wt
##         <chr> <chr>    <dbl>      <dbl>
##  1         BA     F 13.16129   9.161290
##  2         BA     M 12.64286   7.357143
##  3         DM     F 35.71269  41.609685
##  4         DM     M 36.18823  44.353134
##  5         DO     F 35.48633  48.531250
##  6         DO     M 35.69878  49.135102
##  7         DS     F 49.58317 117.749548
##  8         DS     M 50.30194 122.290984
##  9         NL     F 32.02419 154.282209
## 10         NL     M 32.57743 165.652893
## # ... with 38 more rows

Here is a spread table for one of the traits.

surveys %>%
  filter(!(is.na(hindfoot_length) & is.na(weight)),
         !is.na(sex)) %>%
  group_by(species_id, sex) %>%
  summarize(hflen = mean(hindfoot_length, na.rm = TRUE)) %>%
  spread(sex, hflen)

## # A tibble: 25 x 3
## # Groups:   species_id [25]
##    species_id        F        M
##  *      <chr>    <dbl>    <dbl>
##  1         BA 13.16129 12.64286
##  2         DM 35.71269 36.18823
##  3         DO 35.48633 35.69878
##  4         DS 49.58317 50.30194
##  5         NL 32.02419 32.57743
##  6         OL 20.28767 20.75833
##  7         OT 20.28165 20.25740
##  8         OX 19.66667 18.80000
##  9         PB 25.88267 26.43199
## 10         PE 20.27206 20.13213
## # ... with 15 more rows

How would you do this for two traits? Turns out this is more challenging, and depends on what you want to compare. The simplest thing is to arrange or subset the earlier table by sex. If you want four columns on one line, for sex and the two traits, you will have to be creative. What you want to do is avoid typing the trait names or the sexes if you can. I don’t have an easy solution!

The following gives us a list of 2-trait tables by sex. Notice the use of map from package purrr to act separately on each sex.

tmp <- surveys %>%
  filter(!(is.na(hindfoot_length) & is.na(weight)),
         !is.na(sex))
tmp <- split(tmp, tmp$sex)
tmp2 <- map(tmp, function(x) {
  x %>% 
    group_by(species_id, sex) %>%
    summarize(hflen = mean(hindfoot_length, na.rm = TRUE),
            wt = mean(weight, na.rm = TRUE))
})
tmp2

## $F
## # A tibble: 23 x 4
## # Groups:   species_id [?]
##    species_id   sex    hflen        wt
##         <chr> <chr>    <dbl>     <dbl>
##  1         BA     F 13.16129   9.16129
##  2         DM     F 35.71269  41.60968
##  3         DO     F 35.48633  48.53125
##  4         DS     F 49.58317 117.74955
##  5         NL     F 32.02419 154.28221
##  6         OL     F 20.28767  31.06582
##  7         OT     F 20.28165  24.83090
##  8         OX     F 19.66667  21.00000
##  9         PB     F 25.88267  30.21088
## 10         PE     F 20.27206  22.82218
## # ... with 13 more rows
## 
## $M
## # A tibble: 25 x 4
## # Groups:   species_id [?]
##    species_id   sex    hflen         wt
##         <chr> <chr>    <dbl>      <dbl>
##  1         BA     M 12.64286   7.357143
##  2         DM     M 36.18823  44.353134
##  3         DO     M 35.69878  49.135102
##  4         DS     M 50.30194 122.290984
##  5         NL     M 32.57743 165.652893
##  6         OL     M 20.75833  32.068410
##  7         OT     M 20.25740  23.695382
##  8         OX     M 18.80000  21.000000
##  9         PB     M 26.43199  33.816498
## 10         PE     M 20.13213  20.584906
## # ... with 15 more rows

What about this? It gives small tables by species. The challenge is that species that do not have records for both sexes will mess up any easy combining.

tmp <- surveys %>%
  filter(!(is.na(hindfoot_length) & is.na(weight)),
         !is.na(sex))
tmp <- split(tmp, tmp$species_id)
tmp2 <- map(tmp, function(x) {
  x %>% 
    select(-species_id) %>%
    group_by(sex) %>%
    summarize(hflen = mean(hindfoot_length, na.rm = TRUE),
            wt = mean(weight, na.rm = TRUE))
})
tmp2

## $BA
## # A tibble: 2 x 3
##     sex    hflen       wt
##   <chr>    <dbl>    <dbl>
## 1     F 13.16129 9.161290
## 2     M 12.64286 7.357143
## 
## $DM
## # A tibble: 2 x 3
##     sex    hflen       wt
##   <chr>    <dbl>    <dbl>
## 1     F 35.71269 41.60968
## 2     M 36.18823 44.35313
## 
## $DO
## # A tibble: 2 x 3
##     sex    hflen       wt
##   <chr>    <dbl>    <dbl>
## 1     F 35.48633 48.53125
## 2     M 35.69878 49.13510
## 
## $DS
## # A tibble: 2 x 3
##     sex    hflen       wt
##   <chr>    <dbl>    <dbl>
## 1     F 49.58317 117.7495
## 2     M 50.30194 122.2910
## 
## $NL
## # A tibble: 2 x 3
##     sex    hflen       wt
##   <chr>    <dbl>    <dbl>
## 1     F 32.02419 154.2822
## 2     M 32.57743 165.6529
## 
## $OL
## # A tibble: 2 x 3
##     sex    hflen       wt
##   <chr>    <dbl>    <dbl>
## 1     F 20.28767 31.06582
## 2     M 20.75833 32.06841
## 
## $OT
## # A tibble: 2 x 3
##     sex    hflen       wt
##   <chr>    <dbl>    <dbl>
## 1     F 20.28165 24.83090
## 2     M 20.25740 23.69538
## 
## $OX
## # A tibble: 2 x 3
##     sex    hflen    wt
##   <chr>    <dbl> <dbl>
## 1     F 19.66667    21
## 2     M 18.80000    21
## 
## $PB
## # A tibble: 2 x 3
##     sex    hflen       wt
##   <chr>    <dbl>    <dbl>
## 1     F 25.88267 30.21088
## 2     M 26.43199 33.81650
## 
## $PE
## # A tibble: 2 x 3
##     sex    hflen       wt
##   <chr>    <dbl>    <dbl>
## 1     F 20.27206 22.82218
## 2     M 20.13213 20.58491
## 
## $PF
## # A tibble: 2 x 3
##     sex    hflen       wt
##   <chr>    <dbl>    <dbl>
## 1     F 15.55062 7.974394
## 2     M 15.62078 7.885287
## 
## $PH
## # A tibble: 2 x 3
##     sex    hflen       wt
##   <chr>    <dbl>    <dbl>
## 1     F 25.90000 30.85000
## 2     M 25.54545 31.45455
## 
## $PI
## # A tibble: 1 x 3
##     sex hflen       wt
##   <chr> <dbl>    <dbl>
## 1     M  22.5 19.42857
## 
## $PL
## # A tibble: 2 x 3
##     sex    hflen       wt
##   <chr>    <dbl>    <dbl>
## 1     F 19.75000 19.31250
## 2     M 20.26316 18.68421
## 
## $PM
## # A tibble: 2 x 3
##     sex    hflen       wt
##   <chr>    <dbl>    <dbl>
## 1     F 20.27978 22.12567
## 2     M 20.53002 20.80120
## 
## $PP
## # A tibble: 2 x 3
##     sex    hflen       wt
##   <chr>    <dbl>    <dbl>
## 1     F 21.67662 17.18067
## 2     M 21.83864 17.18449
## 
## $PX
## # A tibble: 2 x 3
##     sex hflen    wt
##   <chr> <dbl> <dbl>
## 1     F    19    20
## 2     M    20    18
## 
## $RF
## # A tibble: 2 x 3
##     sex    hflen       wt
##   <chr>    <dbl>    <dbl>
## 1     F 17.52727 13.68421
## 2     M 17.50000 12.44444
## 
## $RM
## # A tibble: 2 x 3
##     sex    hflen       wt
##   <chr>    <dbl>    <dbl>
## 1     F 16.39188 11.10220
## 2     M 16.49151 10.12323
## 
## $RO
## # A tibble: 2 x 3
##     sex hflen    wt
##   <chr> <dbl> <dbl>
## 1     F 14.75  11.0
## 2     M 16.00   9.5
## 
## $RX
## # A tibble: 1 x 3
##     sex hflen    wt
##   <chr> <dbl> <dbl>
## 1     M  18.5  15.5
## 
## $SF
## # A tibble: 2 x 3
##     sex    hflen       wt
##   <chr>    <dbl>    <dbl>
## 1     F 27.50000 69.00000
## 2     M 26.34783 53.43478
## 
## $SH
## # A tibble: 2 x 3
##     sex    hflen       wt
##   <chr>    <dbl>    <dbl>
## 1     F 29.02817 78.75641
## 2     M 27.98333 65.17742
## 
## $SO
## # A tibble: 2 x 3
##     sex    hflen       wt
##   <chr>    <dbl>    <dbl>
## 1     F 25.63333 55.32258
## 2     M 25.72727 55.70000
## 
## $SS
## # A tibble: 2 x 3
##     sex hflen    wt
##   <chr> <dbl> <dbl>
## 1     F   NaN    57
## 2     M   NaN   130

Bringing in species information

Suppose you want to bring in taxonomic information from another file. While you can use left_join(surveys, species) to create a larger table, what can we do keeping the two tables separate? This can be useful if both tables are large, or if they are configured in different ways. Here

species <- read_csv("../data/species.csv")

## Parsed with column specification:
## cols(
##   species_id = col_character(),
##   genus = col_character(),
##   species = col_character(),
##   taxa = col_character()
## )

Here we want summaries by genus. Notice the use of map_df from purrr to get results as a data frame.

species_taxa <- split(species, species$genus)

map_df(species_taxa,
       function(x, surveys) {
         # Identify species in this taxa
         taxa_sp <- x$species_id
         surveys %>%
           filter(!(is.na(hindfoot_length) & is.na(weight)),
                  !is.na(sex),
           species_id %in% taxa_sp) %>%
           group_by(sex) %>%
           summarize(hflen = mean(hindfoot_length, na.rm = TRUE),
                     wt = mean(weight, na.rm = TRUE))
         },
       surveys,
       .id = "genus")

## # A tibble: 20 x 4
##              genus   sex    hflen         wt
##              <chr> <chr>    <dbl>      <dbl>
##  1         Baiomys     F 13.16129   9.161290
##  2         Baiomys     M 12.64286   7.357143
##  3     Chaetodipus     F 23.82453  23.763824
##  4     Chaetodipus     M 23.92986  24.712219
##  5       Dipodomys     F 37.87090  55.244360
##  6       Dipodomys     M 37.91624  56.243034
##  7         Neotoma     F 32.02419 154.282209
##  8         Neotoma     M 32.57743 165.652893
##  9       Onychomys     F 20.28219  26.780959
## 10       Onychomys     M 20.40312  26.246466
## 11     Perognathus     F 15.82996   8.574803
## 12     Perognathus     M 15.76056   8.204182
## 13      Peromyscus     F 20.26602  22.491649
## 14      Peromyscus     M 20.29880  20.644279
## 15 Reithrodontomys     F 16.43876  11.220080
## 16 Reithrodontomys     M 16.50682  10.159941
## 17        Sigmodon     F 27.94872  71.696000
## 18        Sigmodon     M 27.31915  61.336842
## 19    Spermophilus     F      NaN  57.000000
## 20    Spermophilus     M      NaN 130.000000

Species using dplyr and purrr

Brian S. Yandell

9/27/2017

Bringing in species information