These materials are for statisticians at all levels who want to learn more about modern network and computing tools for statistics. The course for Fall 2015 is: Stat 627 Professional Skills for Data Science (1-2 cr, Thursdays 1-2:15pm, 1010 MSC). [This is NOT the Stat 327 “Data Analysis with R” course.] This is recommended for all new graduate students. It will emphasize training in best practices with R, focusing on how to organize your work (using Rstudio), how to build excellent graphics (using ggplot2) and how to document your work (using R markdown), among other things. This is being led by Brian Yandell and Doug Bates in Fall 2015; see moodle site, Stat 627 Syllabus and Jenny Bryan’s Stat 545 at UBC. The course covers
The raw material for Fall 2014 was at https://github.com/dmbates/stat692 with more material in Yandell’s Stat 692 Notes related to Bates’s talks.
Much of the material below was added for Fall 2013. We used the following network tools to deliver information: drupal (open access web pages–this page); box (drop box for collaboration); and moodle (course collaboration environment).
We are in the age of big data, when it is not enough to think of what you can do with the computer on your desk or lap. Further, statistics as a field is being transformed by analytics, the process of discovery and communication of meaningful patterns in data. Today, statisticians of all flavors–from the most pragmatic to the most theoretical–need a variety of computational tools, and we need access to vast resources across computer networks. While many of us work in closed shops, behind proprietary walls, the world of open source is core to sharing methods and information. Thus, it is important to understand the power and role of the linux operating system and of the R language. However, these only the beginning. While we focus on computational skills, communication is key to the field of statistics, and to science in general. And visualization is at the core of communicating complex ideas, providing a window of insight into the world of data. Today, much communication is online, and we must learn how to leverage online data and network tools to advance in the field, and to do our work effectively.
Many of the links above have communication resources.
Visualization is the key to quick insights with data.
Statistics is a department embedded in the UW infrastructure. Much of our system, including stat.wisc.edu email and backup, is coordinated with the Computer Systems Lab (CSL), but the wiring and wireless infrastructure is maintained by the campus. As such, it is important to learn about our system, the CSL system, and the campus systems.
Linux is the “operating system”, the system that organizes work done on many computers, including the main UW Statistics machines. When you type instructions, or commands, on Linux, you do this with a “shell”, which has a language structure worth learning. The primary shell for linux is the Bourne again shell (bash), written by Brian Fox and named humorously after the designer of the first unix shell, Steve Bourne.
R is one of many languages and other electronic tools of value to staticians.
We have a local directory /p/stat/Data with data sets from Devore’s “Engineering Statistics” (Devore, used in Statistics 312), Box, Hunter & Hunter’s “Statistics for Experimenters” (BHH, used in Statistics 424), and Milliken & Johnson’s “Analysis of Messy Data” (MJ); see also Yandell’s “Practical Data Analysis”. The Devore directory has both portable Minitab worksheets and system specific worksheets. Useful to consult StatLib or the Virtual Library: Statistics for the official lists of datasets maintained by the statistics community. Also, the Internet Scout Toolkit is an excellent source for datasets from many disciplines and organizations.