purrr package

The dplyr package acts on data frames, filtering rows and selecting columns as one would in a database. The purrr package acts on lists, including data frames. The cool thing is that the lists may contain rather complicated objects. There is now a purrr cheatsheat Below is a made-up example to illustrate some features.

library(purrr)
library(dplyr)
library(tidyr)

Let’s create some fake data. Here x is random and y equals column number + x + noise.

nc <- 5
nr <- 10

x <- as.list(seq_len(nc))

x <- as.data.frame(matrix(rnorm(nr * nc), nr, nc))
x <- as.data.frame(x)

y <- t(seq_len(nc) + t(x) * seq_len(nc)) + 
  matrix(rnorm(nr * nc,, 0.001), nr, nc)
y <- as.data.frame(y)

Now we want to run a regress of each column of x on each column of y and report out the coefficients. This could be done in a loop, but let’s instead imagine what we would do with a single column.

lm_coef <- function(x1, y1) {
  dat <- data.frame(x = x1, y = y1)
  coef(lm(y ~ x, dat))
}

lm_coef(x[,1], y[,1])

## (Intercept)           x 
##   1.0002675   0.9998238

We want to combine x and y into one list. We first make a list of lists, then transpoe it.

xy <- list(x = as.list(x), y = as.list(y))
str(xy)

## List of 2
##  $ x:List of 5
##   ..$ V1: num [1:10] -0.778 -0.102 0.522 -0.857 -0.817 ...
##   ..$ V2: num [1:10] -0.3331 0.6234 1.9845 0.4081 0.0759 ...
##   ..$ V3: num [1:10] 0.423 -1.476 0.16 -1.288 0.953 ...
##   ..$ V4: num [1:10] -0.8662 -1.6173 -0.8681 0.0735 -1.3462 ...
##   ..$ V5: num [1:10] -0.191 1.271 -0.294 -0.503 -0.56 ...
##  $ y:List of 5
##   ..$ V1: num [1:10] 0.222 0.898 1.523 0.144 0.184 ...
##   ..$ V2: num [1:10] 1.33 3.25 5.97 2.82 2.15 ...
##   ..$ V3: num [1:10] 4.271 -1.428 3.481 -0.865 5.859 ...
##   ..$ V4: num [1:10] 0.535 -2.469 0.528 4.292 -1.385 ...
##   ..$ V5: num [1:10] 4.05 11.36 3.53 2.49 2.2 ...

xy <- xy %>%
  transpose
str(xy)

## List of 5
##  $ V1:List of 2
##   ..$ x: num [1:10] -0.778 -0.102 0.522 -0.857 -0.817 ...
##   ..$ y: num [1:10] 0.222 0.898 1.523 0.144 0.184 ...
##  $ V2:List of 2
##   ..$ x: num [1:10] -0.3331 0.6234 1.9845 0.4081 0.0759 ...
##   ..$ y: num [1:10] 1.33 3.25 5.97 2.82 2.15 ...
##  $ V3:List of 2
##   ..$ x: num [1:10] 0.423 -1.476 0.16 -1.288 0.953 ...
##   ..$ y: num [1:10] 4.271 -1.428 3.481 -0.865 5.859 ...
##  $ V4:List of 2
##   ..$ x: num [1:10] -0.8662 -1.6173 -0.8681 0.0735 -1.3462 ...
##   ..$ y: num [1:10] 0.535 -2.469 0.528 4.292 -1.385 ...
##  $ V5:List of 2
##   ..$ x: num [1:10] -0.191 1.271 -0.294 -0.503 -0.56 ...
##   ..$ y: num [1:10] 4.05 11.36 3.53 2.49 2.2 ...

xy %>% 
  map(function(dat) coef(lm(y~x, dat)))

## $V1
## (Intercept)           x 
##   1.0002675   0.9998238 
## 
## $V2
## (Intercept)           x 
##    2.000001    2.000000 
## 
## $V3
## (Intercept)           x 
##    2.999773    3.000250 
## 
## $V4
## (Intercept)           x 
##    4.000331    4.000295 
## 
## $V5
## (Intercept)           x 
##    4.999892    5.000622

This can also be done in two steps. At the end, we organize data a bit.

xy %>% 
  map(function(dat) lm(y~x, dat)) %>%
  map(coef) %>%
  as.data.frame %>%
  t

##    (Intercept)         x
## V1    1.000268 0.9998238
## V2    2.000001 2.0000002
## V3    2.999773 3.0002502
## V4    4.000331 4.0002945
## V5    4.999892 5.0006219

Redo with dplyr

Of course all the above can be done readily with dplyr and tidyr using group_by and do, as shown below. However, two things are useful to consider:

While summarize works for single value operations, you must use do for multiple value operations. It is challenging to get do correct, as it must return a one-row data frame.
The purrr verbs map and transpose do not require the lists to be the same length or configuration. Thus, they can be used in a variety of settings where dplyr, working on data frames, cannot.

xx <- x %>%
  gather(var, xval)
yy <- y %>%
  gather(var, yval)
xx$yval <- yy$yval

xx %>% 
  group_by(var) %>%
  do(
    as.data.frame(
    t(
      coef(
        lm(yval ~ xval, .)))))

## # A tibble: 5 x 3
## # Groups:   var [5]
##     var `(Intercept)`      xval
##   <chr>         <dbl>     <dbl>
## 1    V1      1.000268 0.9998238
## 2    V2      2.000001 2.0000002
## 3    V3      2.999773 3.0002502
## 4    V4      4.000331 4.0002945
## 5    V5      4.999892 5.0006219

purrr package

Brian S. Yandell

9/18/2017

Redo with dplyr