There are many good introductions to data structures in R
. Basically, R
can act as a calculator, and has built in functions such as log(10)
and sqrt(4)
, but can do so much more. See examples in the links below.
The basic structure in R
is a vector, here is a vector of the numbers 1 through 5:
1:5
[1] 1 2 3 4 5
Challenge
Go through the Data Carpentry or Jenny Bryan introduction to learn more about data basics. Read through Doug Bates notes to fill in further gaps.
Jenny Bryan expounds on data science. GapMinder was an early innovation in data exploration. The GapMinder data is used throughout Jenny Bryan’s course, and can now be found as the gapminder R
package. Hans Rosling, originator of GapMinder, has passed, but his tools and ideas persist.
It is often more useful to think of R
as mostly operating on rectangular tables, which have many of the characteristics of spreadsheets. Tables, known in R
as data frames, have observations (individuals or cases) in rows and variables (fields) in columns. The variables might be of different types, say numeric or character or logical. We might want to operate on the whole table, or on a row or column, or on some rectangular subset of the table.
The tidyverse provides a useful set of tools to organize work around tables. See the tidyverse style guide for detail. See also the data pages from various sources:
Right here would like to go through an example using for/while/repeat
loops, then base apply
suite, then dplyr
and tidyr
, then purrr
. The point would be that while the old way gets the job done, the newer tidyverse provides more compact ways of organizing the task that are more intuitive, once one overcomes the new way of thinking about the problem.
Here is the basic idea. It might seem best to go a row and column at a time and do operations, but if we can think about the whole object, the entire table, and what we want to do with it, then a more elegant solution might emerge.
for
vs. apply
To answer the frequently asked question, ‘How can I avoid this loop or make it faster?’: Try to use simple vectorized operations; use the family of apply functions if appropriate; initialize objects to full length when using loops; and do not repeat calculations many times if performing them just once is sufficient. Measure execution time before making changes to code, and only make changes if the efficiency gain really matters. It is better to have readable code that is free of bugs than to waste hours optimizing code to gain a fraction of a second. Sometimes, in fact, a loop will provide a clear and efficient solution to a problem (considering both time and memory use). (from https://www.r-project.org/doc/Rnews/Rnews_2008-1.pdf, p. 49)
There is a package called plyr, which seems to be retained for historical reasons and compatibility with other packages. Best to skip this package and go right to dplyr
. These two packages have different syntax and approach problems somewhat differently.
The dplyr
package allows us to filter by rows and select columns, and to do tasks in groups organized by levels of columns. Further, these steps can be strung together using pipes
as a coherent operation on a table, to create a new table. This new table may be stored, or might be used in another operation without ever being saved as a new object. See notes linked above, and the following.
The latest release of dplyr
(0.7.0) in its basic form appears unchanged. However, there is a new philosophy of “tidy evaluation” that will have a profound effect on how we use dplyr
tools within functional programming. That is, the straightforward way to use dplyr
is to reference table columns by name, which works great in an interactive setting. If the column names might change from table to table, and you want to create tools that leverage that, it will be important to learn about tidy evaluation.
The package tidyr is a companion to dplyr
, with tools to rearrange tables. It is not covered in Data Carpentry.
Note: There used to be one Data Wrangling CheatSheet for dplyr
and tidyr
. It has been replaced by the Data Transformation and Data Input CheatSheets.
Some use the older reshape2, which includes some column aggregations features that are not easily done in tidyr
. See for instance:
If you are new to programming, this subsection might be skipped. Experienced programmers who are new to R
might find this useful. The basic argument: everything in R is a vector (no scalars!). See also
Atomic vectors include integers, double-precision, strings (character), logical and a few others.
(v <- 1:6)
[1] 1 2 3 4 5 6
A list is a vector of objects that don’t have to share relationships (but can). Here we do something silly, turning a vector of six numbers into a list, which is a vector of six objects, each of which is a vector with one number.
(as.list(v <- 1:6))
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
[[4]]
[1] 4
[[5]]
[1] 5
[[6]]
[1] 6
Attributes can be attached to objects
v <- 1:6
attributes(v) <- list(message = "hello world")
v
[1] 1 2 3 4 5 6
attr(,"message")
[1] "hello world"
Attributes can turn a vector into a matrix
v <- 1:6
attributes(v) <- list(dim= c(3,2))
v
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
This is equivalent to
(x <- matrix(1:6, ncol=2))
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
Note that while v
(and x
) is a matrix, it is still a vector of six numbers:
attributes(v)
$dim
[1] 3 2
length(v)
[1] 6
v[4]
[1] 4
A matrix can be turned into a data frame:
(x <- as.data.frame(x))
V1 V2
1 1 4
2 2 5
3 3 6
Now x
as a data frame is a vector (actually a list) of length 2, with each element of the list being a vector of numbers:
x[[2]]
[1] 4 5 6
Subset using positive integers:
v[c(3,4,6,6)]
[1] 3 4 6 6
v[1:3]
[1] 1 2 3
Subset using logical
v[c(FALSE, FALSE, TRUE)]
[1] 3 6
This takes some explaining, the vector c(FALSE, FALSE, TRUE)
is cycled through the indices of v
(here 1:6
). Try this one:
v[c(TRUE, TRUE, FALSE, FALSE)]
[1] 1 2 5 6
Logical with normal dist example:
(r <- rnorm(6, 0, 1))
[1] 0.02301301 -0.52960868 1.49565139 -0.64454178 -0.01451554 0.76069083
v[r>0]
[1] 1 3 6
(r <- rnorm(15, 0, 1))
[1] -1.151110064 -0.118116418 -0.739648098 -0.906311215 0.986999972
[6] 0.617108642 0.009230102 -0.493441499 1.118718042 0.277161713
[11] 2.087755742 0.101934087 0.936605199 1.196402675 -0.108575219
v[r>0]
[1] 5 6 NA NA NA NA NA NA NA
Notice that in this case some of the indices are beyond the data and return missing values.
Use negative numbers to exclude some indices. You cannot mix positive and negative numbers.
v[-(1:3)]
[1] 4 5 6
Zero (0) is skipped. Works with positive or negative numbers. This is useful for instance with the match()
function if you set nomatch = 0
.
v[c(0,2,4)]
[1] 2 4
v[c(0,-2,-4)]
[1] 1 3 5 6
v[0]
integer(0)
Note that if all indices to a vector are 0, then a vector of length 0 is returned. This also happens with NULL
:
v[NULL]
integer(0)
(r <- sample(letters, 15))
[1] "p" "u" "o" "m" "q" "t" "x" "n" "h" "j" "l" "a" "c" "e" "w"
(m <- match(c("a","c","e"), r, nomatch = 0))
[1] 12 13 14
v[m]
[1] NA NA NA
Elements of a vector can be identified by name
v <- c(v)
names(v) <- letters[seq_along(v)]
v
a b c d e f
1 2 3 4 5 6
v["c"]
c
3
For lists, you can use [[]]
or $
. Double brackets (special for lists) pull out components; $
is a shorthand that works for proper names.
l <- as.list(v)
(l[["c"]])
[1] 3
(l$c)
[1] 3
We saw recycling of logical indices above. Here is another form of recycling:
v + c(1,2)
a b c d e f
2 4 4 6 6 8
will return 2, 4, 4, 6, 6, etc.
While T
and F
work for TRUE
and FALSE
, don’t use them. Always spell out the full word. Avoid bizarre circumstances. And avoid using these as names of objects.
Logical expressions can use the shorter or longer form of AND (&
vs &&
) or OR (|
vs ||
). See the help(&)
page or R bloggers for full details. The shorter form performs elementwise comparisons in much the same way as arithmetic operators. The longer form evaluates left to right examining only the first element of each vector. Evaluation proceeds only until the result is determined. The longer form is appropriate for programming control-flow and typically preferred in if clauses.