Modern data analysis with R and Rstudio tends to use a growing set of tools developed by Hadley Wickham and the Rstudio crew. This so-called Hadleyverse reimagines many of the basic concepts, nouns and verbs of R. Fundamental to this is dplyr
. This companion to the Aggregating and analyzing data with dplyr lesson looks at some of the concepts from the Data and data frames lesson using dplyr
.
dplyr
package to explore the structure and content of a tibbledata.frame
We are going to use the readr
and dplyr
packages. If you do not have them, please use install.packages
to install.
install.packages(readr,dplyr)
Now attach these packages as libraries for use.
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
We return to the species and weight of animals caught in plots in a study area in Arizona over time. The dataset is stored as a CSV file: each row holds information for a single animal, and the columns represent:
Column | Description |
---|---|
record_id | Unique id for the observation |
month | month of observation |
day | day of observation |
year | year of observation |
plot_id | ID of a particular plot |
species_id | 2-letter code |
sex | sex of animal (“M”, “F”) |
hindfoot_length | length of the hindfoot in mm |
weight | weight of the animal in grams |
genus | genus of animal |
species | species of animal |
taxa | e.g. Rodent, Reptile, Bird, Rabbit |
plot_type | type of plot |
The data are available at http://kbroman.org/datacarp/portal_data_joined.csv.
We can read that data straight from the web. Here we use read_csv
from the readr
package. It is redesigned from read.csv
. Note that read_csv
treats blanks “” and “NA” both as missing, for instance.
surveys <- read_csv("http://kbroman.org/datacarp/portal_data_joined.csv")
## Parsed with column specification:
## cols(
## record_id = col_integer(),
## month = col_integer(),
## day = col_integer(),
## year = col_integer(),
## plot_id = col_integer(),
## species_id = col_character(),
## sex = col_character(),
## hindfoot_length = col_integer(),
## weight = col_integer(),
## genus = col_character(),
## species = col_character(),
## taxa = col_character(),
## plot_type = col_character()
## )
The readr
and dplyr
packages use a reimaging of a data.frame to a tibble
, or augmented table. [There is a tibble
package, but we won’t discuss it.] Tibbles have column names but do not have row names; hence rows and columns are treated somewhat differently, more like for tables in an SQL database, for instance.
Tibbles have many slick features, one being that you don’t need to use head
to see the beginning of the table. All the inspection tools described before still apply, but sometimes there are nicer ways.
surveys
## # A tibble: 34,786 x 13
## record_id month day year plot_id species_id sex hindfoot_length
## <int> <int> <int> <int> <int> <chr> <chr> <int>
## 1 1 7 16 1977 2 NL M 32
## 2 72 8 19 1977 2 NL M 31
## 3 224 9 13 1977 2 NL <NA> NA
## 4 266 10 16 1977 2 NL <NA> NA
## 5 349 11 12 1977 2 NL <NA> NA
## 6 363 11 12 1977 2 NL <NA> NA
## 7 435 12 10 1977 2 NL <NA> NA
## 8 506 1 8 1978 2 NL <NA> NA
## 9 588 2 18 1978 2 NL M NA
## 10 661 3 11 1978 2 NL <NA> NA
## # ... with 34,776 more rows, and 5 more variables: weight <int>,
## # genus <chr>, species <chr>, taxa <chr>, plot_type <chr>
Based on the tibble
of surveys
,
surveys
?sex
and year
?We can pull out parts of a data frame using filter
and select
. We need to provide two values: filter row and select column.
For example, to get the element in the 1st row, 1st column, we can use the verbs filter
and select
, along with row_number
, to return a 1 by 1 tibble:
surveys %>%
filter(row_number() == 1) %>%
select(1)
## # A tibble: 1 x 1
## record_id
## <int>
## 1 1
Notice that we can select
a column by its number (or name – see below), but we filter
rows by logical expressions, in this case the row number must be 1.
To get the element in the 2nd row, 7th column:
surveys %>%
filter(row_number() ==2) %>%
select(7)
## # A tibble: 1 x 1
## sex
## <chr>
## 1 M
To select the entire 2nd row:
surveys %>%
filter(row_number() ==2)
## # A tibble: 1 x 13
## record_id month day year plot_id species_id sex hindfoot_length
## <int> <int> <int> <int> <int> <chr> <chr> <int>
## 1 72 8 19 1977 2 NL M 31
## # ... with 5 more variables: weight <int>, genus <chr>, species <chr>,
## # taxa <chr>, plot_type <chr>
And to filter to the entire 7th column:
surveys %>%
select(7)
## # A tibble: 34,786 x 1
## sex
## <chr>
## 1 M
## 2 M
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 <NA>
## 7 <NA>
## 8 <NA>
## 9 M
## 10 <NA>
## # ... with 34,776 more rows
You can also refer to columns by name, in multiple ways. However, the class
of the resulting object may differ
class(surveys %>% select(sex))
## [1] "tbl_df" "tbl" "data.frame"
class(surveys[,"sex"])
## [1] "tbl_df" "tbl" "data.frame"
class(surveys$sex)
## [1] "character"
class(surveys[["sex"]])
## [1] "character"
When we select a single column using dplyr
tools, the result is still a tibble
(or tbl_df
); that is, it is a table with one column. When we select a column as if it were a data.frame
, the result is a character vector.
How would you get the first element of the sex
column as a character if you started with a tibble
representation such as surveys %>% select(sex)
?
To get slices of our tibble, we can include a vector for the row or column indexes (or both). Notice the use of %in%
to filter row numbers in a range.
surveys %>%
filter(row_number() %in% 1:3) %>%
select(7) # first three elements in the 7th column
## # A tibble: 3 x 1
## sex
## <chr>
## 1 M
## 2 M
## 3 <NA>
surveys %>%
filter(row_number() == 1) %>%
select(1:3) # first three columns in the first row
## # A tibble: 1 x 3
## record_id month day
## <int> <int> <int>
## 1 1 7 16
surveys %>%
filter(row_number() %in% 2:4) %>%
select(6:7) # rows 2-4, columns 6-7
## # A tibble: 3 x 2
## species_id sex
## <chr> <chr>
## 1 NL M
## 2 NL <NA>
## 3 NL <NA>
Note again that usually operations on a tibble
result in a tibble
, keeping column headers and table structure.
Use row_number()
, in conjuction with either seq()
or the modulo (mod) operator %%
, to create a new data.frame
called surveys_by_10
that includes every 10th row of the survey data frame starting at row 10 (10, 20, 30, …).
[Hint for mod: what does seq(20) %% 5
yield?]
The dplyr
intro uses tally
several times. If you look on the help page for tally
you will find count
. You will also find reference to n
and sum
. Here are several ways to do the same task. Why are there several ways? Who knows? But sometimes you want to do other things in tandem, and one of these might work best.
surveys %>%
group_by(sex) %>%
tally
## # A tibble: 3 x 2
## sex n
## <chr> <int>
## 1 F 15690
## 2 M 17348
## 3 <NA> 1748
surveys %>%
count(sex)
## # A tibble: 3 x 2
## sex n
## <chr> <int>
## 1 F 15690
## 2 M 17348
## 3 <NA> 1748
surveys %>%
group_by(sex) %>%
summarize(count = n())
## # A tibble: 3 x 2
## sex count
## <chr> <int>
## 1 F 15690
## 2 M 17348
## 3 <NA> 1748
surveys %>%
group_by(sex) %>%
summarize(count = sum(!is.na(year)))
## # A tibble: 3 x 2
## sex count
## <chr> <int>
## 1 F 15690
## 2 M 17348
## 3 <NA> 1748