Modern data analysis with R and Rstudio tends to use a growing set of tools developed by Hadley Wickham and the Rstudio crew. This so-called Hadleyverse reimagines many of the basic concepts, nouns and verbs of R. Fundamental to this is dplyr. This companion to the Aggregating and analyzing data with dplyr lesson looks at some of the concepts from the Data and data frames lesson using dplyr.

Learning Objectives

Setup

We are going to use the readr and dplyr packages. If you do not have them, please use install.packages to install.

install.packages(readr,dplyr)

Now attach these packages as libraries for use.

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Survey Data

We return to the species and weight of animals caught in plots in a study area in Arizona over time. The dataset is stored as a CSV file: each row holds information for a single animal, and the columns represent:

Column Description
record_id Unique id for the observation
month month of observation
day day of observation
year year of observation
plot_id ID of a particular plot
species_id 2-letter code
sex sex of animal (“M”, “F”)
hindfoot_length length of the hindfoot in mm
weight weight of the animal in grams
genus genus of animal
species species of animal
taxa e.g. Rodent, Reptile, Bird, Rabbit
plot_type type of plot

The data are available at http://kbroman.org/datacarp/portal_data_joined.csv.

We can read that data straight from the web. Here we use read_csv from the readr package. It is redesigned from read.csv. Note that read_csv treats blanks “” and “NA” both as missing, for instance.

surveys <- read_csv("http://kbroman.org/datacarp/portal_data_joined.csv")
## Parsed with column specification:
## cols(
##   record_id = col_integer(),
##   month = col_integer(),
##   day = col_integer(),
##   year = col_integer(),
##   plot_id = col_integer(),
##   species_id = col_character(),
##   sex = col_character(),
##   hindfoot_length = col_integer(),
##   weight = col_integer(),
##   genus = col_character(),
##   species = col_character(),
##   taxa = col_character(),
##   plot_type = col_character()
## )

Data frames and Tibbles

The readr and dplyr packages use a reimaging of a data.frame to a tibble, or augmented table. [There is a tibble package, but we won’t discuss it.] Tibbles have column names but do not have row names; hence rows and columns are treated somewhat differently, more like for tables in an SQL database, for instance.

Tibbles have many slick features, one being that you don’t need to use head to see the beginning of the table. All the inspection tools described before still apply, but sometimes there are nicer ways.

surveys
## # A tibble: 34,786 x 13
##    record_id month   day  year plot_id species_id   sex hindfoot_length
##        <int> <int> <int> <int>   <int>      <chr> <chr>           <int>
##  1         1     7    16  1977       2         NL     M              32
##  2        72     8    19  1977       2         NL     M              31
##  3       224     9    13  1977       2         NL  <NA>              NA
##  4       266    10    16  1977       2         NL  <NA>              NA
##  5       349    11    12  1977       2         NL  <NA>              NA
##  6       363    11    12  1977       2         NL  <NA>              NA
##  7       435    12    10  1977       2         NL  <NA>              NA
##  8       506     1     8  1978       2         NL  <NA>              NA
##  9       588     2    18  1978       2         NL     M              NA
## 10       661     3    11  1978       2         NL  <NA>              NA
## # ... with 34,776 more rows, and 5 more variables: weight <int>,
## #   genus <chr>, species <chr>, taxa <chr>, plot_type <chr>

Challenge

Based on the tibble of surveys,

  • How many rows and columns are in surveys?
  • What are the classes of the columns for sex and year?

Indexing, Sequences, and Subsetting

We can pull out parts of a data frame using filter and select. We need to provide two values: filter row and select column.

For example, to get the element in the 1st row, 1st column, we can use the verbs filter and select, along with row_number, to return a 1 by 1 tibble:

surveys %>%
  filter(row_number() == 1) %>%
  select(1)
## # A tibble: 1 x 1
##   record_id
##       <int>
## 1         1

Notice that we can select a column by its number (or name – see below), but we filter rows by logical expressions, in this case the row number must be 1.

To get the element in the 2nd row, 7th column:

surveys %>%
  filter(row_number() ==2) %>%
  select(7)
## # A tibble: 1 x 1
##     sex
##   <chr>
## 1     M

To select the entire 2nd row:

surveys %>%
  filter(row_number() ==2)
## # A tibble: 1 x 13
##   record_id month   day  year plot_id species_id   sex hindfoot_length
##       <int> <int> <int> <int>   <int>      <chr> <chr>           <int>
## 1        72     8    19  1977       2         NL     M              31
## # ... with 5 more variables: weight <int>, genus <chr>, species <chr>,
## #   taxa <chr>, plot_type <chr>

And to filter to the entire 7th column:

surveys %>%
  select(7)
## # A tibble: 34,786 x 1
##      sex
##    <chr>
##  1     M
##  2     M
##  3  <NA>
##  4  <NA>
##  5  <NA>
##  6  <NA>
##  7  <NA>
##  8  <NA>
##  9     M
## 10  <NA>
## # ... with 34,776 more rows

You can also refer to columns by name, in multiple ways. However, the class of the resulting object may differ

class(surveys %>% select(sex))
## [1] "tbl_df"     "tbl"        "data.frame"
class(surveys[,"sex"])
## [1] "tbl_df"     "tbl"        "data.frame"
class(surveys$sex)
## [1] "character"
class(surveys[["sex"]])
## [1] "character"

When we select a single column using dplyr tools, the result is still a tibble (or tbl_df); that is, it is a table with one column. When we select a column as if it were a data.frame, the result is a character vector.

Challenge

How would you get the first element of the sex column as a character if you started with a tibble representation such as surveys %>% select(sex)?

Slices

To get slices of our tibble, we can include a vector for the row or column indexes (or both). Notice the use of %in% to filter row numbers in a range.

surveys %>%
  filter(row_number() %in% 1:3) %>%
  select(7)  # first three elements in the 7th column
## # A tibble: 3 x 1
##     sex
##   <chr>
## 1     M
## 2     M
## 3  <NA>
surveys %>%
  filter(row_number() == 1) %>%
  select(1:3)   # first three columns in the first row
## # A tibble: 1 x 3
##   record_id month   day
##       <int> <int> <int>
## 1         1     7    16
surveys %>%
  filter(row_number() %in% 2:4) %>%
  select(6:7) # rows 2-4, columns 6-7
## # A tibble: 3 x 2
##   species_id   sex
##        <chr> <chr>
## 1         NL     M
## 2         NL  <NA>
## 3         NL  <NA>

Note again that usually operations on a tibble result in a tibble, keeping column headers and table structure.

Challenge

Use row_number(), in conjuction with either seq() or the modulo (mod) operator %%, to create a new data.frame called surveys_by_10 that includes every 10th row of the survey data frame starting at row 10 (10, 20, 30, …).

[Hint for mod: what does seq(20) %% 5 yield?]

More tally stuff

The dplyr intro uses tally several times. If you look on the help page for tally you will find count. You will also find reference to n and sum. Here are several ways to do the same task. Why are there several ways? Who knows? But sometimes you want to do other things in tandem, and one of these might work best.

surveys %>%
  group_by(sex) %>%
  tally
## # A tibble: 3 x 2
##     sex     n
##   <chr> <int>
## 1     F 15690
## 2     M 17348
## 3  <NA>  1748
surveys %>%
  count(sex)
## # A tibble: 3 x 2
##     sex     n
##   <chr> <int>
## 1     F 15690
## 2     M 17348
## 3  <NA>  1748
surveys %>%
  group_by(sex) %>%
  summarize(count = n())
## # A tibble: 3 x 2
##     sex count
##   <chr> <int>
## 1     F 15690
## 2     M 17348
## 3  <NA>  1748
surveys %>%
  group_by(sex) %>%
  summarize(count = sum(!is.na(year)))
## # A tibble: 3 x 2
##     sex count
##   <chr> <int>
## 1     F 15690
## 2     M 17348
## 3  <NA>  1748