R for Data Sciences Syllabus

Description: This material is aimed at providing teams in the data sciences with an understanding of and experience with professional skills in data science. Researchers today must organize data projects to be able to repeat tasks and share data, ideas, reports and code with others in diverse teams. They need to do this quickly in real time and on a longer term, being able to reproduce tasks – either their own or those of others – months or years later. This involves building documents as a project evolves to capture work flow, and sharing data methods and results with team collaborators. To do this well, researchers as data scientists need to be skilled with internet tools, sophisticated use of statistical languages (such as R) and other emerging topics.

Learning Objectives: After completing this material, an individual will be able to

use R and RStudio as platform for statistical computing
- install R and RStudio on personal computer
- navigate RStudio integrated development environment (IDE)
- use help and internet searches to answer questions
curate data in R, including
- read, manipulate and display data summaries in concise tables
- work with data frames using tidyverse tools
- create functions to collapse repeated steps into one-line “verbs”
- write cleaned up data table out in CSV format
visualize data with plots
- visualize data with the ggplot2 package
- understand key components of the grammar of graphics
- develop interactive graphics with plotly or ggvis
- examine related packages (cowplot, GGally, grid viewports)
- create heatmaps (pheatmap)
- use shiny to share results on the web
organize data methods and documentation
- document ongoing work with R Markdown
- use git and github to keep track of code and document changes with version control
- organize functions, documentation and data into packages (R libraries) to share
- create and manage external databases from R objects
analyze data with statistical models
- correlate measurements and compare across groups using linear models
- develop annotated plots that reflect project design
- organize linear model results with broom, car, or lsmeans packages
- clustering and trees, building networks
- network observations in connected graphs
profile code for efficiency and error checking
- profile code to identify bottlenecks and logic errors
- simulate data to study statistical properties
- create graphics to diagnose patterns in raw and derived data
connect with other data science tools beyond R
- use unix/linux shell to search and modify project
- build a basic pipeline or workflow in the shell
- high throughput computing

References

R for Data Sciences Syllabus

Brian S Yandell

July 2017