surveys <- read.csv("http://kbroman.org/datacarp/portal_data_joined.csv",
stringsAsFactors = FALSE)
library(dplyr) ## load the package
surveys %>%
filter(taxa == "Rodent",
!is.na(weight)) %>%
group_by(sex,genus) %>%
summarize(mean_weight = mean(weight))
## # A tibble: 28 x 3
## # Groups: sex [?]
## sex genus mean_weight
## <chr> <chr> <dbl>
## 1 Chaetodipus 19.81250
## 2 Dipodomys 81.37143
## 3 Neotoma 167.68750
## 4 Onychomys 23.42857
## 5 Perognathus 6.00000
## 6 Peromyscus 19.87500
## 7 Reithrodontomys 11.08333
## 8 Sigmodon 70.33333
## 9 F Baiomys 9.16129
## 10 F Chaetodipus 23.76382
## # ... with 18 more rows
library(tidyr)
Here we use spread()
from the tidyr
package. This gives a more compact, if wider, table summary.
surveys %>%
filter(taxa == "Rodent",
!is.na(weight)) %>%
group_by(sex,genus) %>%
summarize(mean_weight = mean(weight)) %>%
spread(sex, mean_weight)
## # A tibble: 10 x 4
## genus `` F M
## * <chr> <dbl> <dbl> <dbl>
## 1 Baiomys NA 9.161290 7.357143
## 2 Chaetodipus 19.81250 23.763824 24.712219
## 3 Dipodomys 81.37143 55.244360 56.243034
## 4 Neotoma 167.68750 154.282209 165.652893
## 5 Onychomys 23.42857 26.780959 26.246466
## 6 Perognathus 6.00000 8.574803 8.204182
## 7 Peromyscus 19.87500 22.491649 20.644279
## 8 Reithrodontomys 11.08333 11.220080 10.159941
## 9 Sigmodon 70.33333 71.696000 61.336842
## 10 Spermophilus NA 57.000000 130.000000
Here is a simpler tidy example to just count how many records with weigth by genus and sex.
surveys %>%
filter(taxa == "Rodent",
!is.na(weight)) %>%
count(sex, genus) %>%
spread(sex, n)
## # A tibble: 10 x 4
## genus `` F M
## * <chr> <int> <int> <int>
## 1 Baiomys NA 31 14
## 2 Chaetodipus 16 3201 2627
## 3 Dipodomys 35 6826 8649
## 4 Neotoma 16 652 484
## 5 Onychomys 7 1502 1627
## 6 Perognathus 4 762 813
## 7 Peromyscus 8 958 1206
## 8 Reithrodontomys 12 1245 1363
## 9 Sigmodon 3 125 95
## 10 Spermophilus NA 1 1
The purrr
package works on lists (which includes data frames), giving great generality to what objects might be in the lists. Below we show two examples.
library(purrr)
In the first example, for each genus, we fit a linear model with lm()
and extract the "r.squared"
element from the summary()
of the fit. Note the use of split()
to split the data frame into a list of data frames, one per genus. The map()
function from purrr
returns a list, while the map_dbl()
function returns a vector.
surveys %>%
filter(taxa == "Rodent",
!is.na(weight)) %>%
select(genus,weight,sex) %>%
split(.$genus) %>%
map(~ lm(weight ~ sex, data=.)) %>%
map(summary) %>%
map_dbl("r.squared")
## Baiomys Chaetodipus Dipodomys Neotoma
## 0.164529171 0.003305788 0.001992282 0.016659203
## Onychomys Perognathus Peromyscus Reithrodontomys
## 0.002306456 0.003560761 0.042744223 0.054463691
## Sigmodon Spermophilus
## 0.038990704 1.000000000
Note that for Spermophilus
, the \(R^2\) of 1 reflects that there are only two records, one male and one female (see table above in tidyr
section).
There were three calls to purrr
functions. The first map(~ lm())
call creates a list of "lm"
objects; the second map(summary)
call creates a list of "summary.lm"
objects; the third map_dbl()
creates a vector of double-precision values.
The following more complicated example takes the coef()
of the lm()
fits to get the estimates of coefficients. These coefficients have to be converted into a data frame, and then the data frames are bound together using bind_rows()
.
surveys %>%
filter(taxa == "Rodent",
!is.na(weight)) %>%
select(genus,weight,sex) %>%
split(list(.$genus)) %>%
map(~ lm(weight ~ sex, data=.)) %>%
map(coef) %>%
map(function(x) data.frame(level = names(x), estimate = x,
stringsAsFactors = FALSE)) %>%
bind_rows(.id = "genus")
## genus level estimate
## 1 Baiomys (Intercept) 9.1612903
## 2 Baiomys sexM -1.8041475
## 3 Chaetodipus (Intercept) 19.8125000
## 4 Chaetodipus sexF 3.9513238
## 5 Chaetodipus sexM 4.8997193
## 6 Dipodomys (Intercept) 81.3714286
## 7 Dipodomys sexF -26.1270688
## 8 Dipodomys sexM -25.1283947
## 9 Neotoma (Intercept) 167.6875000
## 10 Neotoma sexF -13.4052914
## 11 Neotoma sexM -2.0346074
## 12 Onychomys (Intercept) 23.4285714
## 13 Onychomys sexF 3.3523873
## 14 Onychomys sexM 2.8178945
## 15 Perognathus (Intercept) 6.0000000
## 16 Perognathus sexF 2.5748031
## 17 Perognathus sexM 2.2041820
## 18 Peromyscus (Intercept) 19.8750000
## 19 Peromyscus sexF 2.6166493
## 20 Peromyscus sexM 0.7692786
## 21 Reithrodontomys (Intercept) 11.0833333
## 22 Reithrodontomys sexF 0.1367470
## 23 Reithrodontomys sexM -0.9233920
## 24 Sigmodon (Intercept) 70.3333333
## 25 Sigmodon sexF 1.3626667
## 26 Sigmodon sexM -8.9964912
## 27 Spermophilus (Intercept) 57.0000000
## 28 Spermophilus sexM 73.0000000
The following is the same idea as above, but using summary()
rather than coef()
. We also spread()
the data for a more compact table.
surveys %>%
filter(taxa == "Rodent",
!is.na(weight)) %>%
select(genus,weight,sex) %>%
split(list(.$genus)) %>%
map(~ lm(weight ~ sex, data=.)) %>%
map(summary) %>%
map(function(x) {
out <- as.data.frame(x$coefficients[,1, drop = FALSE])
out$level <- row.names(out)
out[, 2:1]
}) %>%
bind_rows(.id = "genus") %>%
spread(level, Estimate)
## genus (Intercept) sexF sexM
## 1 Baiomys 9.16129 NA -1.8041475
## 2 Chaetodipus 19.81250 3.951324 4.8997193
## 3 Dipodomys 81.37143 -26.127069 -25.1283947
## 4 Neotoma 167.68750 -13.405291 -2.0346074
## 5 Onychomys 23.42857 3.352387 2.8178945
## 6 Perognathus 6.00000 2.574803 2.2041820
## 7 Peromyscus 19.87500 2.616649 0.7692786
## 8 Reithrodontomys 11.08333 0.136747 -0.9233920
## 9 Sigmodon 70.33333 1.362667 -8.9964912
## 10 Spermophilus 57.00000 NA 73.0000000