Description: This material is aimed at providing teams in the data sciences with an understanding of and experience with professional skills in data science. Researchers today must organize data projects to be able to repeat tasks and share data, ideas, reports and code with others in diverse teams. They need to do this quickly in real time and on a longer term, being able to reproduce tasks – either their own or those of others – months or years later. This involves building documents as a project evolves to capture work flow, and sharing data methods and results with team collaborators. To do this well, researchers as data scientists need to be skilled with internet tools, sophisticated use of statistical languages (such as R
) and other emerging topics.
Learning Objectives: After completing this material, an individual will be able to
- use
R
and RStudio
as platform for statistical computing
- install R and RStudio on personal computer
- navigate
RStudio
integrated development environment (IDE)
- use
help
and internet searches to answer questions
- curate data in
R
, including
- read, manipulate and display data summaries in concise tables
- work with data frames using tidyverse tools
- create functions to collapse repeated steps into one-line “verbs”
- write cleaned up data table out in CSV format
- visualize data with plots
- organize data methods and documentation
- document ongoing work with R Markdown
- use git and github to keep track of code and document changes with version control
- organize functions, documentation and data into packages (
R
libraries) to share
- create and manage external databases from
R
objects
- analyze data with statistical models
- correlate measurements and compare across groups using linear models
- develop annotated plots that reflect project design
- organize linear model results with broom, car, or lsmeans packages
- clustering and trees, building networks
- network observations in connected graphs
- profile code for efficiency and error checking
- profile code to identify bottlenecks and logic errors
- simulate data to study statistical properties
- create graphics to diagnose patterns in raw and derived data
- connect with other data science tools beyond R
- use unix/linux shell to search and modify project
- build a basic pipeline or workflow in the shell
- high throughput computing
References