University of Wisconsin-Madison
Statistics 679: Data Science Computing Project

Check for updates to this tentative syllabus.

Course Description and Goals

Use Linux, R, and distributed computing to analyze data sets too large for a laptop:

  1. Collect and manage data and write programs and documentation via tools suited to large computations:
  2. Run analyses too large for a laptop:
  3. Work in teams to research, develop, write, and make three presentations:

NameOffice HoursPhoneEmail (please use our Q&A forum for most things)
Gillett, JohnMedical Sciences Center 1590   890-3216
Huang, KunlingMedical Sciences Center 1335B

Class Times
Lecture 679-002MoWe 11:00-12:15Service Memorial Institute 133


No textbook is required.

Optional Reference Books
The Linux Command Line by William E Shotts Jr. free online or for sale

A laptop with the free program VirtualBox installed is required in class.

Many questions outside of class should be posted at our Q&A forum. Please feel free to write answers when you know them. We are eager to help in class and office hours too.

(I might revise this to improve the course.)

This 3-credit face-to-face course meets twice for 75 minutes each week and carries the expectation that students will work on the course for about 3 hours out of class for each 75-minute class.

Grades are at These points are available:
  in-class group work on statistical software  20
  individual statistical software exercises  40
  group data science project including presentations  40


If you anticipate religious or other conflicts with course requirements, or if you require accommodation due to disability, you must notify me during the first three weeks of class. You may not make up missed quizzes, homework, or exams, except in the rare case of a documented, serious problem beyond your control.

I encourage you to discuss the course with others, but you must write programs and quizzes by yourself and prevent others from copying your work. (See the UW Academic Integrity policy.)

Tentative Schedule
Week #: Date Subject Homework Due (11:59 p.m.)
1: 9/5/18 Run a linux virtual machine on your laptop (and debug problems)
piazza Q&A forum demo
read email
2: 9/10,12 emacs text editor: reference sheet, demo1 (data.txt, tiny.R, sifting.txt)
emacs: demo2, regular expressions
3: 9/17,19 emacs, continued HW1notes.txt
Lyman-break galaxies
HW1 9/20: emacs
4: 9/24,26 (CHTC accounts: phone number)
Lyman-break galaxies, continued
ideas for HW2
5: 10/1,3 git/GitHub version control system (tiny code example)
Lyman-break galaxies, continued
HW2 10/4: galaxies on laptop
6: 10/8,10 Linux and bash shell scripting
7: 10/15,17 shell scripting, continued
Group work 1: scripting exercises
Group1 10/18: shell scripting
8: 10/22,24 Slurm and UW-Statistics High Performance Computing Cluster (HPC)
9: 10/29,31 Christina Koch: distributed computing via HTCondor at CHTC (slides, handout, manual)
tiny CHTC examples
HW3 11/1: airlines on Slurm
10: 11/5,7 CHTC parallel sd example, sd.tar (run "condor_submit_dag sd.dag")
Group work 2: parallel word counting

11: 11/12,14 Group work 2: CHTC, continued
Using R at CHTC (callingR.tar)
start project proposals (data ideas)
Group2 11/16: CHTC
12: 11/19,21 Group work 3(a): project proposal development
project proposal presentations
[11/22 Thanksgiving]
Group3(a) 11/20: project: proposal
13: 11/26,28 HW4 help
project help
HW4 11/30 (extended to 12/3):
galaxies on CHTC
14: 12/3,5 project help
Group work 3(b): draft presentations
(Group3(b) 12/4: draft presentation)
15: 12/10,12 choose order of revised presentations; suggestions; peer feedback/example
project help
unofficial course feedback
Group work 3(c): revised project presentations
(Group3(c) 12/11: revised presentation)
Group3(d) 12/12 project report