Working with Data in R

Data Structures in R

There are many good introductions to data structures in R. Basically, R can act as a calculator, and has built in functions such as log(10) and sqrt(4), but can do so much more. See examples in the links below.

The basic structure in R is a vector, here is a vector of the numbers 1 through 5:

1:5

[1] 1 2 3 4 5

Challenge

Go through the Data Carpentry or Jenny Bryan introduction to learn more about data basics. Read through Doug Bates notes to fill in further gaps.

Side story on data wrangling

Jenny Bryan expounds on data science. GapMinder was an early innovation in data exploration. The GapMinder data is used throughout Jenny Bryan’s course, and can now be found as the gapminder R package. Hans Rosling, originator of GapMinder, has passed, but his tools and ideas persist.

Data Wrangling

It is often more useful to think of R as mostly operating on rectangular tables, which have many of the characteristics of spreadsheets. Tables, known in R as data frames, have observations (individuals or cases) in rows and variables (fields) in columns. The variables might be of different types, say numeric or character or logical. We might want to operate on the whole table, or on a row or column, or on some rectangular subset of the table.

The tidyverse provides a useful set of tools to organize work around tables. See the tidyverse style guide for detail. See also the data pages from various sources:

Introduction to the Tidyverse (nee Hadleyverse) (Doug Bates)
Aggregating and analyzing data with dplyr (Data Carpentry)
Data analysis 1 (Jenny Bryan) (see links under this topic)

Right here would like to go through an example using for/while/repeat loops, then base apply suite, then dplyr and tidyr, then purrr. The point would be that while the old way gets the job done, the newer tidyverse provides more compact ways of organizing the task that are more intuitive, once one overcomes the new way of thinking about the problem.

Here is the basic idea. It might seem best to go a row and column at a time and do operations, but if we can think about the whole object, the entire table, and what we want to do with it, then a more elegant solution might emerge.

base R: `for` vs. `apply`

To answer the frequently asked question, ‘How can I avoid this loop or make it faster?’: Try to use simple vectorized operations; use the family of apply functions if appropriate; initialize objects to full length when using loops; and do not repeat calculations many times if performing them just once is sufficient. Measure execution time before making changes to code, and only make changes if the efficiency gain really matters. It is better to have readable code that is free of bugs than to waste hours optimizing code to gain a fraction of a second. Sometimes, in fact, a loop will provide a clear and efficient solution to a problem (considering both time and memory use). (from https://www.r-project.org/doc/Rnews/Rnews_2008-1.pdf, p. 49)

plyr

There is a package called plyr, which seems to be retained for historical reasons and compatibility with other packages. Best to skip this package and go right to dplyr. These two packages have different syntax and approach problems somewhat differently.

split-apply-combine strategy paper by Hadley Wickham

dplyr

The dplyr package allows us to filter by rows and select columns, and to do tasks in groups organized by levels of columns. Further, these steps can be strung together using pipes as a coherent operation on a table, to create a new table. This new table may be stored, or might be used in another operation without ever being saved as a new object. See notes linked above, and the following.

Intro to R Adapted to dplyr
Data Wrangling in R by Elizabeth McDaniel (Rmd source)
dplyr on tidyverse
Data Transformation with dplyr (scroll down)

The latest release of dplyr (0.7.0) in its basic form appears unchanged. However, there is a new philosophy of “tidy evaluation” that will have a profound effect on how we use dplyr tools within functional programming. That is, the straightforward way to use dplyr is to reference table columns by name, which works great in an interactive setting. If the column names might change from table to table, and you want to create tools that leverage that, it will be important to learn about tidy evaluation.

dplyr 0.7.0 release announcement
tidy evaluation
RStudio Webinars (see What’s new in dplyr 0.7.0?)

tidyr

The package tidyr is a companion to dplyr, with tools to rearrange tables. It is not covered in Data Carpentry.

Note: There used to be one Data Wrangling CheatSheet for dplyr and tidyr. It has been replaced by the Data Transformation and Data Input CheatSheets.

Some use the older reshape2, which includes some column aggregations features that are not easily done in tidyr. See for instance:

purrr

Prettying up tables

matrix vs data frame vs tibble
knitr figures and tables (Karl Broman) (kable, pander, xtable)
printr package

Everything in R is a vector

If you are new to programming, this subsection might be skipped. Experienced programmers who are new to R might find this useful. The basic argument: everything in R is a vector (no scalars!). See also

Data analysis 2: vectors and files (scroll down to this topic)

Atomic vectors include integers, double-precision, strings (character), logical and a few others.

(v <- 1:6)

[1] 1 2 3 4 5 6

A list is a vector of objects that don’t have to share relationships (but can). Here we do something silly, turning a vector of six numbers into a list, which is a vector of six objects, each of which is a vector with one number.

(as.list(v <- 1:6))

[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

[[4]]
[1] 4

[[5]]
[1] 5

[[6]]
[1] 6

Attributes

Attributes can be attached to objects

v <- 1:6
attributes(v) <- list(message = "hello world")
v

[1] 1 2 3 4 5 6
attr(,"message")
[1] "hello world"

Attributes can turn a vector into a matrix

v <- 1:6
attributes(v) <- list(dim= c(3,2))
v

     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

This is equivalent to

(x <- matrix(1:6, ncol=2))

     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

Note that while v (and x) is a matrix, it is still a vector of six numbers:

attributes(v)

$dim
[1] 3 2

length(v)

[1] 6

v[4]

[1] 4

A matrix can be turned into a data frame:

(x <- as.data.frame(x))

Now x as a data frame is a vector (actually a list) of length 2, with each element of the list being a vector of numbers:

x[[2]]

[1] 4 5 6

Subsetting and Accessing

Subset using positive integers:

v[c(3,4,6,6)]

[1] 3 4 6 6

v[1:3]

[1] 1 2 3

Subset using logical

v[c(FALSE, FALSE, TRUE)]

[1] 3 6

This takes some explaining, the vector c(FALSE, FALSE, TRUE) is cycled through the indices of v (here 1:6). Try this one:

v[c(TRUE, TRUE, FALSE, FALSE)]

[1] 1 2 5 6

Logical with normal dist example:

(r <- rnorm(6, 0, 1))

[1]  0.02301301 -0.52960868  1.49565139 -0.64454178 -0.01451554  0.76069083

v[r>0]

[1] 1 3 6

(r <- rnorm(15, 0, 1))

 [1] -1.151110064 -0.118116418 -0.739648098 -0.906311215  0.986999972
 [6]  0.617108642  0.009230102 -0.493441499  1.118718042  0.277161713
[11]  2.087755742  0.101934087  0.936605199  1.196402675 -0.108575219

v[r>0]

[1]  5  6 NA NA NA NA NA NA NA

Notice that in this case some of the indices are beyond the data and return missing values.

Use negative numbers to exclude some indices. You cannot mix positive and negative numbers.

v[-(1:3)]

[1] 4 5 6

Zero (0) is skipped. Works with positive or negative numbers. This is useful for instance with the match() function if you set nomatch = 0.

v[c(0,2,4)]

[1] 2 4

v[c(0,-2,-4)]

[1] 1 3 5 6

v[0]

integer(0)

Note that if all indices to a vector are 0, then a vector of length 0 is returned. This also happens with NULL:

v[NULL]

integer(0)

(r <- sample(letters, 15))

 [1] "p" "u" "o" "m" "q" "t" "x" "n" "h" "j" "l" "a" "c" "e" "w"

(m <- match(c("a","c","e"), r, nomatch = 0))

[1] 12 13 14

v[m]

[1] NA NA NA

Elements of a vector can be identified by name

v <- c(v)
names(v) <- letters[seq_along(v)]
v

a b c d e f 
1 2 3 4 5 6

v["c"]

c 
3

For lists, you can use [[]] or $. Double brackets (special for lists) pull out components; $ is a shorthand that works for proper names.

l <- as.list(v)
(l[["c"]])

[1] 3

(l$c)

[1] 3

We saw recycling of logical indices above. Here is another form of recycling:

v + c(1,2)

a b c d e f 
2 4 4 6 6 8

will return 2, 4, 4, 6, 6, etc.

Logical

While T and F work for TRUE and FALSE, don’t use them. Always spell out the full word. Avoid bizarre circumstances. And avoid using these as names of objects.

Logical expressions can use the shorter or longer form of AND (& vs &&) or OR (| vs ||). See the help(&) page or R bloggers for full details. The shorter form performs elementwise comparisons in much the same way as arithmetic operators. The longer form evaluates left to right examining only the first element of each vector. Evaluation proceeds only until the result is determined. The longer form is appropriate for programming control-flow and typically preferred in if clauses.

Working with Data in R

Brian S. Yandell

6/29/2017

Data Structures in R

Challenge

Side story on data wrangling

Data Wrangling

base R: `for` vs. `apply`

plyr

dplyr

tidyr

purrr

Prettying up tables

Everything in R is a vector

Attributes

Subsetting and Accessing

Logical

Working with Data in R

Brian S. Yandell

6/29/2017

Data Structures in R

Challenge

Side story on data wrangling

Data Wrangling

base R: for vs. apply

plyr

dplyr

tidyr

purrr

Prettying up tables

Everything in R is a vector

Attributes

Subsetting and Accessing

Logical

base R: `for` vs. `apply`