These are Yandell’s notes from Fall 2014 sessions to augment lectures. These are open to read, somewhat for my benefit and yours. Comments welcome. First link is to Bates’s GitHub document; second is to notes below by date.
github notes
cd
git clone https://github.com/dmbates/stat692
cd stat692
git pull
See also github links at http://www.stat.wisc.edu/network-skills#beyondR
Add notes how to clone git repositorywithin Rstudio (if possible)
===============
Using Install button to get package polynom
Kernighan and Plaugher
Elements of Programming Style
http://en.m.wikipedia.org/wiki/The_Elements_of_Programming_Style
stopifnot()
see Doug's Data.Rmd at his github
help for package
search path
datasets for intro classes--what is out there
count.fields
$ as "syntactic sugar" vs [[]]
2014-09-19
moodle.wisc.edu course page
first homework assignment at moodle
Data.Rmd
with() better than attach() ... detach()
within() to change assignments in structure
count.fields()
manual: input/output ... http://cran.r-project.org/doc/manuals/r-release/R-data.html
read.table: tell what fields are
data organization
wide and long formats
long format--need example
factor and ordered factor
subset
factors, can use drop=TRUE to explicily drop unused levels
or apply factor() to a factor()
stack/unstack; reshape
xtabs(~xtabs(~schoolid, classroom))
xtabs(~xtabs(~classid, classroom))
data visualization
ggplot2.org
pima indian data from faraway
qplot
empirical density plot
use factors for things that should be factors
don't do dumb things with boxplots
2014-09-26
homework comments
be sure to document what you did so you will later recall details
childid problem
classroom <- withing(classroom, {sex <- factor(sex, labels=c("M","F"))
minority <- factor(minority, labels=c("N","Y"))
childid <- factor(childid)
schoolid <- factor(schoolid)
classid <- factor(classid)
}
str(rownames(classroom))
## note: evaluates as character string
all(str(rownames(classroom) == classroom$childid))
loops
R prefers vectorized calculation--work on the whole object
look ahead
cd /afs/cs.wisc.edu/p/stat/data
ls
cd MS.exam
ls -l
cd s11
ls -l
linux basic tools
head file.txt
wc -l file.txt
ls -lh
wc
less or zless
zcat file.gz | wc -l
homework
ex2.Rmd
details of Rmd document
"---" YAML metadata section at top
html_document to pdf_document
preliminary section
set options
```{r initial,echo=FALSE,print=FALSE,cache=FALSE}
library(knitr)
library(ggplot2)
load("classroom.rda")
opts_chunk$set(cache=TRUE)
options(width=76, show,signif.stars=FALSE, str=strOptions(strict.width="cut"))
set.seed(1234
```
comments on cache and chunks
see Rstudio home->products->R packages
knitr (caution--poor navigation of document)
R markdown (read this one)
full specification of pandoc markdown
note: Rstudio has lookahead for options as you type
```{r hist1,echo=FALSE,fig.align="center",fig.cap="Histogram of blah blah",
fig.height=2.5,fig.width=6.5,fig.lp="fig:hist1",fig.pos="top"}
qplot(mathkind,data=classroom, geom="histogram")
```
somehow this did not work; will fix later
back to ggplot2 slides
bivariate plots
log-log scale spreads data out nicely
comparative box plots--horizontal
regression or ancova
stat500 data with geom_smooth(method="lm") + coord_equal()
watch aspect ratio (studied by Bill Cleveland): should be ~45 degree angle
add reference line using geom_abline
note Galton's regression toward the mean
ancova
cathedral data
geom_point(aes(color=style)) + geom_smooth(aes=(col=style),method="lm")
different panels or facets using formula (left=rows, right=columns)
facet_grid(.~style)
Tufte's small multiples
Cleveland: details of how to do this well
2014-10-10
assignment due NEXT Tuesday to produce several figures
look at MS exam data
ex: Spring 2011 MS exam problem on Quadrat Summary
auto conversion
think of names of variables and number of unique values--are they labels?
note "X" prefix on variable identifier
single bracket on data.frame gets data.frame: summ[1] vs summ[[1]]
the first one is like using subset() on a data.frame
single vs double ampersand (&)
single: component-wise AND
double: programmatic AND (evaluates first as scalar; if FALSE do not evaluate second)
are there any entries that have Browsed but not Present?
nrow(subset(summ, X2m_SppPresent == "N" & X2m_SPPBrowsed == "Y"))
-----------------------
Linear Models
brain weight data
ggplot, scale_x_log10()
geom_smooth(): mostly linear but a bit of a bump
linear: geom_smooth(method="lm")
model fitting functions
style suggestion for reproducibility
use data option with data.frame after cleaned
keep Rmd document with steps of cleaning and steps in analysis
note explicit inclusion of intercept as "~ 1 +"
model.matrix()
gaussian model: y ~ N(Xb, s^2 I)
linear model form works here, but not for more general models (e.g. Poisson)
str() of model.matrix -- notice "assign" attribute
many subtle things go on behind the scenes
model matrix more appropriate name than "design matrix"
suppress intercept: "~ 0 +"
coef(summary(fm1)) includes variability but coef(fm1) does not
vcov(fm1) provides variance-covariance matrix of coefficients
note: stderr() is error message stream, NOT standard error
offset(log(BodyWt)) to test if slope is 1
plot of model fit: get 4 of 6 possible diagnostic plots
think about aspect ratio
diagnostics useful, but designed for blind selection of "best" model
often better these days to use visualization tools to explore
methods for lm objects
generic functions such as print, summary, plot
what methods does a particular class have?
non-visible functions (not in NAMESPACE)
extractor functions such as coef, deviance, logLik
use methods that exist to "future proof" your code
role of methods can change in the future
provide extractors in the code
2014-10-17
Rmd and figures
short wide figures more flexible
---
title:
author:
date:
preamble
output:
pdf_document:
fig_caption:yes
fig_height: 3
---
preliminary R section
set options, load libraries, create initial objects (such as plots)
## subsection (can be numbered if desired--see manual)
now look at report and plots for kindergarten math scores
notice outlying points--what is going on?
test has minimum and maximum scores
minimum is allegedly adjusted for possibility of guessing: artificial value
notice more variability in boys than girls
style: figure captions as complete sentences, explaining plot. Enough information so busy reader can read caption and comprehend what is going on.
side-by-side plots: perceive shape but not shift
histogram vs quantile-quantile vs histogram vs boxplots
note minority vs non-minority
scatter plot of score vs gain (=difference)
notice collinearity: pre and post scores each have variability
low initial scores have high gains
loess curve: see http://en.m.wikipedia.org/wiki/Local_regression
be careful of interpreting intercepts: are they meaningful?
scatterplot of pre vs post scores (more appropriate)
use theme(aspect = 1) or coord_equal() to set aspect ratio to one
facet_grid(sex ~ .) puts one on top of the other
don't know how to put labels on left rather than top
facets use the same axes: can look at patterns between
experiment: what communicates best
is the variability among schools, classrooms, individuals?
be careful about confounding these nested sources of variability
coord_flip()
what kind of spread within classroom of individuals?
careful of outliers and leverage if plotting gains
gains are not effective measure of instructor!
think about ordering of classrooms
by ID: meaningless
by mean
school11 <- within(droplevels(subset(classroom, schoolid==11),
classid <- reorder(classid, mathgain))
note: reorder has third argument for function to use for reordering
digression on multilevel modelling/hierarchical modeling
caution on longitudinal data and migration
==========================
Dealing with Big Data
what is Big Data? (http://www.stat.wisc.edu/bigdata)
volume: very large
velocity: arriving and changing rapidly
variety: many sources
veracity:
validity
volatility
Big Data in R
R is slow interpreter and is single threaded
in-memory storate, conservative copying
R is not sutied to real-time processing of data streams
Moore's law
number of transistors on IC would double in approximately 18 months
roughly interpreted as speed of processor
density of circuits is now incredible (sub-micron geometries)
density of memory has exploded (solid state drive)
but cannot speed up clock rate (related to dissipation of energy)
modern computing architectures evolving quickly
algorithms designed around how to move data
to use effectively, want to spread out the calculation
goal: distribute work across multiple cores
divide and conquer: map-reduce, also known as hadoop
use threads, but have to prepare them
can use snow, mpi, to farm out jobs
big volume data in R
read "R Data Import/Export" http://cran.r-project.org/doc/manuals/r-release/R-data.html
archive.ics.uci.edu/ml/machine-learning-databases/00216
4.5 Gb size
next time: linear models,extractors,contrasts
2014-10-31
Linear Models
QR decomposition
see notes
log(BodyWt) as response for brain weight data
methods for qr objects
note Q1 and Q2 are orthogonal, which implies directly that resid and fitted are orthogonal
decomposing into orthogonal subspaces
Fisher with poor eyesight spent evenings thinking rather than reading
statistics is not a formula, instead it refers to the model
note: Householder usually yields negative intercept, due to reflection
style: extract pieces from model rather than going back to original data
slides show how to extract RSS, coefficients, etc.
very useful for simulations with same predictors (even if shuffled)
model.frame and model.response: caution on missing data
model.frame removes missing values and applied functions/transformations
model.response only pulls out response for non-missing cases
relation of linear algebra (Math 340) to numerical linear algebra
never evaluate inverses except R1 if interested in (X'X)^-1
computational rank--tolerances "close" to zero
basic linear algebra subroutines
LAPACK benchmark
Revolution R Open (http://www.revolutionanalytics.com/revolution-r-open) has accelerated BLAS using multi-threading
but does not affect LS calculations due to rank deficiencies with factors (more later)
Categorical covariates
degrees of freedom: remove 1 for intercept
JEdward Deming: burning the toast and then scraping it--dummy variables
xtable package by David Dahl
```{r nm, results="asis", echo=FALSE}
options(xtable.type="HTML")
xtable(fm2)
oldclass <- class(fm2)
class(fm2) <- "lm"
xtable(fm2)
class(fm2) <- oldclass
```
first call gives ANOVA table; second call gives coefficient table
Model matrix for factor spray
coef = difference of levels with first level, which is a contrast
orthogonal contrasts, orthogonal polynomials
contr.treatment and contr.poly are most often used
contr.SAS matches SAS
contr.sum(2) for 2-level factorial designs
Box, Hunter and Hunter book: http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471718130.html
details of 2^3 model matrix and design matrix
2014-11-07
questions on homework
facet wrap
ch 9: multi-factor experiments
9.1 anova
formula ".~." is old formula, for instance in update()
battery separator data
options(contrasts = c("contr.treatment","contr.helmert"))
summary(fm1 <- aov(y ~ time * temp * silica, separate))
notice 3-way interaction is highly significant
heredity principle: retain all parents of higher order terms
model.matrix(fm1) gives you encoding
contrast choice gives you orthogonal matrix
more stable
each coefficient is independent
followup: looking at subsets (temp == "high")
blocking factors
gl() generate levels
ftable(xtabs()) useful to flatten tables
blocking factor--op--always goes in first
unreplicated multi-factor design
have to give up something, usually highest order interactions
9.2 2^k factorial designs
saturated model if use all interactions--no d.f. for error
QQ-plot: outliers suggest significant effects
normal plot has challenges with sign of signals
half-normal is more stable: distribution is chi = square root of chi-square
should go through the origin
look for points above line through origin
9,3 fractional factorial design
run only part of the experiment (save money and time)
but be careful how you lay it out (plan ahead)
control what main effects and interactions are aliased (completely confounding)
see notes for more details--this is material for Stat 424 for instance
ch 10 regression models
10,1 inference for a regression line
timetemp data: ancova
are the lines different?
are they parallel?
is there a non-negligible slope?
Venables: analysis of variance is really about comparing models, not terms in models
10.2 other regression models
note I(x^2) to create second-order numeric interaction
look at Rmd code to see details
confidence intervals confint()
predict() to get prediction values and confidence intervals
design is according to variables
model.matrix is derived from design
continuous and categorical variates
back to timetemp data
factor coefficient typeOEM is important: repaired vs OEM
change in slope measured by typeOEM:temp
2014-11-14
SimSpeed.Rmd for today
case study with large simulation of student
good packages to study:
datatable package
dplyr and magrittr packages by Hadley Wickham
using alr4 packages
why not attach package?
keeps workspace cleaner
use "LazyData" in DESCRIPTION file for package
then only loads what you need
if that is not done, then
data("ufc", package="alr4")
will explicitly load "ufc" without other stuff in package
str(ufc)
library(ggplot2)
## clean up data
qplot(Species, Dbh, data=ufc,geom="boxplot",xlab="Tree Species",
ylab="Diameter at breast height (mm)") + coord_flip()
## reorder in meaningful way
qplot(reorder(Species,Dbh), Dbh, data=ufc,geom="boxplot",xlab="Tree Species",
ylab="Diameter at breast height (mm)") + coord_flip()
## could add "+ scale_y_log10()"
## meaningful scatterplot
qplot(Dbh, Height, data=ufc,geom="point",
xlab="Diameter at breast height (mm)",
ylab="Tree height (dm)") +
scale_y_log10() + scale_x_log10() +
facet_wrap(~ Species)
## note some species are rare--drop them?
## fit model for one species
ufcwc <- within(subset(ufc, Species == "WC"), Species <- NULL)
m1 <- lm(Height ~ 1 + Dbh, data = ufcwc)
options(show.signif.stars = FALSE)
summary(m1)
## be sure to do reality check--look at fit
qplot(Dbh, Height, data=ufcwc) + geom_smooth(method="lm")
qplot(Dbh, Height, data=ufcwc) + geom_smooth()
## poor fit on linear scale suggests log-log fit
summary(m2 <- lm(log(Height) ~ 1 + log(Dbh), data = ufcwc))
## residual plot
plot(m2, which=1)
reduced <- droplevels(subset(ufc, Species %in% c("DF","GF","WC")))
str(reduced)
summary(m3 <- lm(log(Height) ~ 1 + Species + log(Dbh), data = reduced))
anova(m3)
## and also model with interaction (replace "+" by "*")
-------------------------------------------
Simulation Speed handout
## pull out R code
library(knitr)
purl("SimSpeed.Rmd")
## then march through the SimSpeed.R code
system.time(resRepl <- replicate(N, mean(rexp(n))))
## replicate statistic mean() N times with sample size n
qplot(resRep) + geom_density()
caution on loops: set result vector before you do the loop
don't append
resloop1 <- numeric(N)
system.time(for(i in 1:N) resLoop1[i] <- mean(rexp(n)))
## don't do
resloop1 <- numeric()
another approach: tradeoff space vs speed
str(MM <- matrix(rexp(n*N), nr = n)
system.time(resColMeans <- colMeans(matrix(rexp(n*N), nr = n)))
very fast, but creates large matrix
another way: meta-programming using apply()
apply function mean() over 2nd index
resApply(MM, 1, mean)
force garbage collection: gc()
do this before any systematic timing to reduce variability
family of apply meta-functions
2014-11-21
case study of analysis
lessons
use profiling capabilities (based on functions)
things take a long time in areas of code you don't expect
local environments within functions are removed
think (and write) in terms of vectors
be aware of column major ordering: work down columns in nested loops
for (k) for (j) for (i) value[i,j,k]
vast differences in speed for different areas of memory
avoid creating vectors by appending in loops
careful about naming objects: readability, searchability
always set.seed() for simulations
makes strange behavior repeatable
start with 2000 4x4 matrices
generate new set of 4x5 matrices by adding a column
evaluate a probability (need all matrices available)
then shift columns over and drop the last
then iterate this whole process
lists vs arrays
array is homogeneous, all elements of same type, easier to access
list is heterogeneous, need set of (indirect) pointers to different types
chain <- array(NA, c(4,4,2000))
## NB: NA is by default a floating point number
## advantage of NA: find out which values are not set (by sum(is.na(...)))
vectorize runif(n) rather than for(i in 1:n) runif(1)
rfr <- function(n) -1/log(runif(n))
chain <- array(rfr(4*4*2000), c(4,4,2000))
in this case, loop does not cost, but sometimes it does
caution: your guess of where code spends time may be way off
rdirichlet(): scalar returns vector, vector returns matrix
cbind()
both these functions can be quite slow
environments
function call creates an environment
environment removed at exit
gtools package: dirichlet distribution stuff among other
rep(0,n) is slower than numeric(n)
watch out for 1:n when n=0 is {1,0}
consider seq_len(n)
function advice
pass in numeric arguments
figure out dimensions inside (no need to pass)
create temporary objects inside (will be removed with local env)
vectorize where you can
short code (with documentation) on whole object often easier to figure out
R is a functional language
unit testing: "trust but verify" (RM Nixon)
arrays and indexing operator
very flexible
again, work on whole object if you can
compiler package can "byte compile" functions
may get 30% speed increase
functional programming: apply function to an object
LISP structure sitting under R
functional vs mechanistic
divide and conquer
modern statistical ideas concern patterns and processes, which are functions
organizing code by functions can help reveal/communicate methods
Timing
set.seed() ## make repeatable
system.time(myfun(args))
profiling to measure where code bogs down
call stack written out
Rprof() ## turn on
myfun(args)
Rprof(NULL) ## turn off
summaryRprof()
know several ways to use indices into arrays
diag() can take long time
for [i,j,k] can use an array as index
help("[")
lst[1] vs lst[[1]] for list: what do you get
2014-12-05
http://hilaryparker.com/2014/04/29/writing-an-r-package-from-scratch/
install.packages("devtools")
install.packages("roxygen2")
github and bitbucket as code repositories
library(devtools)
install_github("klutometris/roxygen")
Roxygen
emacs ESS widget
M-x ess-roxy-update-entry create header
##' @title name of function
##' @param
##' @return a modified variable
##' @export if function is to be exported
create package withing Rstudio
new project -> new R package -> add source files
x create git repository
x use packrat with this project (snapshot of depends packages)
DESCRIPTION
License: GPL2 usually
GPL3 related to FFF & Apple Xtools
other possible licenses
NAMESPACE
Read-and-delete-me: read it for instructions
now that you have a package
devtools::document()
takes @arguments directly out of code file using roxygen
keeps document information in one file (with code)
updates NAMESPACE and man pages
script in source as example
##' @examples
##' L <- 4
...
NULL
==================
build tag under upper right Rstudio window
Build & Reload: try to build package
Check: check for errors and all sorts of stuff
design philosophy
go from a file to a package to a git repository
all content built into signngle file
always write examples
separate area for tests: do functions work together
testthat by Hadley Wickham (watch out for approximate equal floating point)
and produce result you think they should
devtools::test tests all test_that in a package
=======================
cran.rstudio.com
R Manuals
R Data Import/Export
Writing R Extension
complete references but intimidating
documents CRAN standards
=========================
Rsave datasets
extra directories saved in installed package
inst directory
special data for tests
raw source before manipulation
=============
r-pkgs.had.co.nz/intro.html
R packages by Hadley Wickham
nicely written, compatible with Rstudio
see also his book
jekyll: fancy web pages?
=================
philosophy
long file or many short files, say ordered by type of function
simulations all in one place
update from one set of values to another
===================
efficiencies
compile code into byte code
LazyCode
LazyData
===============
philosophy on proprietary notebooks
need license
v. open source
transfer of "program culture" training across packages
2014-12-12
Special Topics Suggested by Students
publication, tables, graphics
LaTeX: Rmd -> Rnw
Rmd = R + markdown
Rnw = R + LaTeX
beamer
xtable package for tables ready for LaTeX
special LaTeX packages for extra features
graphics
avoid base R graphics
ggplot2 preferred these days
think about design of plots
everyone looks at figures, tables are for obligated stuff
parallel processing
examples of reports
exploring R functions
read.csv: type name and look at what it is (but comments stripped)
cd /unsup/R-devel/src/library
cd base
ls -la
you will see man, R, inst, demo folders
cd R
look at individual *.R files
or look in utils for readtable.R
.Internal or .Primitive or .Call go to compiled C code
not easy to look at these (such as exp())
see R journal for Uwe Ligges article on how to find these
http://stackoverflow.com/questions/14035506/how-to-see-the-source-code-of-r-internal-or-primitive-function
writing R functions
adv-r.had.co.nz Hadley Wickham
think in terms of the whole object
alternative languages
Julia, Python, Matlab
characteristics of R (mid 1980s design)
interpreted, vectorized, functional semantics
each function evaluation traces through code
don't modify arguments -> do a lot of copying
storage allocation and garbage collection are important
garbage collector
global stop mark sweep
generational garbage collector
reference counting for localized garbage collecting
Ed Deming: bruning the toast and then scraping it
homo-iconic
everything is (internally) a vector
there are no scalars
running an R loop is potentially expensive
updating is expensive as well
Matlab/Octave
Matlab: proprietary with many nice GUI features
Octave: open source implementation (bug for bug compatibility)
developed by John Eaton, UW Chem Engr
community has developed but is not flourishing
everything is a matrix; vectorization
R API and Rcpp
API: interface for other languages (C, Fortran)
has to be in R format, etc.
Rcpp: Dirk Eddelbuettel and Romain Francois http://www.rcpp.org
http://cran.r-project.org/web/packages/Rcpp/vignettes/Rcpp-package.pdf
template meta-programming, but complicated (seamless?)
Python: general purpose programming language
used extensively in bioinformatics
scripting language, can be used as a filter between applications
object oriented in C++/Java sense: methods associated with a data type
many libraries / packages
technical computing requires add-ons to python
byte compiler and derivative languages
Julia http://julialang.org from MIT
recently developed (2012)
uses modern CS tools (just in time JIT compiler)
most successful JIT: javascript and Google's V8 engine
LLVM low level virtual machine project
U Chicago and Argonne Nat Lab
open source compiler technology, highly modular
youtube EuroSciPy 2014 Steven Johnson talk
http://www.youtube.com/watch?v=jhlVHoeB05A
wide variety of data types
function based, uses multiple dispatches
dispatch on signature of input formats (think R S3 and S4 methods)
multiple dispatches consider several arguments
see Bates presentations under github account
powerful, flexible language where the future lies