The dplyr
package acts on data frames, filter
ing rows and select
ing columns as one would in a database. The purrr
package acts on lists, including data frames. The cool thing is that the lists may contain rather complicated objects. There is now a purrr cheatsheat Below is a made-up example to illustrate some features.
library(purrr)
library(dplyr)
library(tidyr)
Let’s create some fake data. Here x
is random and y
equals column number + x
+ noise.
nc <- 5
nr <- 10
x <- as.list(seq_len(nc))
x <- as.data.frame(matrix(rnorm(nr * nc), nr, nc))
x <- as.data.frame(x)
y <- t(seq_len(nc) + t(x) * seq_len(nc)) +
matrix(rnorm(nr * nc,, 0.001), nr, nc)
y <- as.data.frame(y)
Now we want to run a regress of each column of x
on each column of y
and report out the coefficients. This could be done in a loop, but let’s instead imagine what we would do with a single column.
lm_coef <- function(x1, y1) {
dat <- data.frame(x = x1, y = y1)
coef(lm(y ~ x, dat))
}
lm_coef(x[,1], y[,1])
## (Intercept) x
## 1.0002675 0.9998238
We want to combine x
and y
into one list. We first make a list of lists, then transpoe it.
xy <- list(x = as.list(x), y = as.list(y))
str(xy)
## List of 2
## $ x:List of 5
## ..$ V1: num [1:10] -0.778 -0.102 0.522 -0.857 -0.817 ...
## ..$ V2: num [1:10] -0.3331 0.6234 1.9845 0.4081 0.0759 ...
## ..$ V3: num [1:10] 0.423 -1.476 0.16 -1.288 0.953 ...
## ..$ V4: num [1:10] -0.8662 -1.6173 -0.8681 0.0735 -1.3462 ...
## ..$ V5: num [1:10] -0.191 1.271 -0.294 -0.503 -0.56 ...
## $ y:List of 5
## ..$ V1: num [1:10] 0.222 0.898 1.523 0.144 0.184 ...
## ..$ V2: num [1:10] 1.33 3.25 5.97 2.82 2.15 ...
## ..$ V3: num [1:10] 4.271 -1.428 3.481 -0.865 5.859 ...
## ..$ V4: num [1:10] 0.535 -2.469 0.528 4.292 -1.385 ...
## ..$ V5: num [1:10] 4.05 11.36 3.53 2.49 2.2 ...
xy <- xy %>%
transpose
str(xy)
## List of 5
## $ V1:List of 2
## ..$ x: num [1:10] -0.778 -0.102 0.522 -0.857 -0.817 ...
## ..$ y: num [1:10] 0.222 0.898 1.523 0.144 0.184 ...
## $ V2:List of 2
## ..$ x: num [1:10] -0.3331 0.6234 1.9845 0.4081 0.0759 ...
## ..$ y: num [1:10] 1.33 3.25 5.97 2.82 2.15 ...
## $ V3:List of 2
## ..$ x: num [1:10] 0.423 -1.476 0.16 -1.288 0.953 ...
## ..$ y: num [1:10] 4.271 -1.428 3.481 -0.865 5.859 ...
## $ V4:List of 2
## ..$ x: num [1:10] -0.8662 -1.6173 -0.8681 0.0735 -1.3462 ...
## ..$ y: num [1:10] 0.535 -2.469 0.528 4.292 -1.385 ...
## $ V5:List of 2
## ..$ x: num [1:10] -0.191 1.271 -0.294 -0.503 -0.56 ...
## ..$ y: num [1:10] 4.05 11.36 3.53 2.49 2.2 ...
xy %>%
map(function(dat) coef(lm(y~x, dat)))
## $V1
## (Intercept) x
## 1.0002675 0.9998238
##
## $V2
## (Intercept) x
## 2.000001 2.000000
##
## $V3
## (Intercept) x
## 2.999773 3.000250
##
## $V4
## (Intercept) x
## 4.000331 4.000295
##
## $V5
## (Intercept) x
## 4.999892 5.000622
This can also be done in two steps. At the end, we organize data a bit.
xy %>%
map(function(dat) lm(y~x, dat)) %>%
map(coef) %>%
as.data.frame %>%
t
## (Intercept) x
## V1 1.000268 0.9998238
## V2 2.000001 2.0000002
## V3 2.999773 3.0002502
## V4 4.000331 4.0002945
## V5 4.999892 5.0006219
Of course all the above can be done readily with dplyr
and tidyr
using group_by
and do
, as shown below. However, two things are useful to consider:
summarize
works for single value operations, you must use do
for multiple value operations. It is challenging to get do
correct, as it must return a one-row data frame.purrr
verbs map
and transpose
do not require the lists to be the same length or configuration. Thus, they can be used in a variety of settings where dplyr
, working on data frames, cannot.xx <- x %>%
gather(var, xval)
yy <- y %>%
gather(var, yval)
xx$yval <- yy$yval
xx %>%
group_by(var) %>%
do(
as.data.frame(
t(
coef(
lm(yval ~ xval, .)))))
## # A tibble: 5 x 3
## # Groups: var [5]
## var `(Intercept)` xval
## <chr> <dbl> <dbl>
## 1 V1 1.000268 0.9998238
## 2 V2 2.000001 2.0000002
## 3 V3 2.999773 3.0002502
## 4 V4 4.000331 4.0002945
## 5 V5 4.999892 5.0006219