STATS700-002: Topics in Statistics, Fall 2017

This half-semester topics course will survey the tools and frameworks currently popular in industry and academia for collecting and analyzing data, focusing on the Python programming language. Topics will include dealing with structured data, basic UNIX command line tools, visualizing data using matplotlib, distributed computing using Hadoop and Spark, and building statistical models using TensorFlow.

  Instructor: Keith Levin, klevin | at | umich | dot | edu
GSI: Roger Fan, rogerfan | at | umich | dot | edu
Lectures: Mondays and Wednesdays 2:30pm to 4:00pm in Chem 1200
Instructor Office Hours: Tuesdays and Thursdays 2:30pm to 4:00pm in West Hall 313, or by appointment
GSI Office Hours: Tuesdays 10:30am to 11:30am in Chem 1720
Textbook: There is no required textbook. We will make frequent reference to Charles Severance's textbook. See below for weekly readings.
Syllabus: Available here
Prerequisites: There are no formal prerequisites for this course. Previous exposure to a programming language, preferably Python, is highly recommended.
Homework submission instructions: Instructions for submitting homeworks as well as a homework template can be found here.

Course Schedule

Date Topics Readings Notes
Monday, Oct 23 Course introduction; Administrivia; Regular Expressions Severance Chapter 11: Regular expressions (required); Python regex documentation (recommended); Jupiter documentation (recommended) HW1 out; Request a Flux/Fladoop username, if necessary; Slides
Wednesday, Oct 25 Markup languages; HTML, XML, JSON Severance Chapter 12 (HTTP,HTML) and Chapter 13 (XML,JSON) (required); BeautifulSoup documentation (just Quick Start) (required); BeautifulSoup documentation (everything up to sections about CSS) (recommended); BeautifulSoup4 tutorial (recommended) Slides
Monday, Oct 30 Markup languages (continued) Same as previous lecture. Slides
Wednesday, Nov 1 Databases; SQL Oracle relational databases overview (only the overview!) (required); First section of Python sqlite3 documentation or Python 3 version (required); w3schools SQL tutorial (recommended) Slides
Monday, Nov 6 Data visualization with matplotlib Numpy quickstart tutorial (required); Pyplot tutorial (required); Pyplot API (recommended); The Visual Display of Quantitative Information by Edward Tufte (recommended); Visual and Statistical Thinking: Displays of Evidence for Making Decisions by Edward Tufte (recommended) HW2 out; Slides
Wednesday, Nov 8 Introduction to the Command Line Introduction to UNIX Commands (required); Survival guide for UNIX newbies (recommended); GNU/Linux Command−Line Tools Summary (recommended) Slides
Monday, Nov 13 Introduction to Hadoop and MapReduce J. Dean and S. Ghemawat MapReduce: Simplified Data Processing on Large Clusters in Proceedings of the Sixth Symposium on Operating System Design and Implementation, 2004 (required); Introduction to HDFS by J. Hanson (recommended) HW3 out; Slides
Wednesday, Nov 15 MapReduce in Python: mrjob mrjob Fundamentals and Concepts (required); Hadoop wiki: How MapReduce operations are actually carried out (required); Allen Downey’s Think Python Chapter 15 on Objects (pages 143-149, recommended); Classes and objects in Python (recommended) HW1 due; Slides; Demo code
Monday, Nov 20 mrjob cont'd; MapReduce in Python: Spark Spark programming guide (required); PySpark programming guide (required); Spark MLlib, a Spark machine learning library (recommended); Spark GraphX, a Spark library for processing graph data (recommended) Slides
Wednesday, Nov 22 PySpark cont'd Same as previous lecture. HW2 due; Slides
Monday, Nov 27 Google TensorFlow TensorFlow tutorial: Getting Started with TensorFlow (required); Abadi, et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems (required); Assorted tutorials on statistical and neural models in TensorFlow (recommended) Slides
Wednesday, Nov 29 TensorFlow cont'd TF tutorial on recognizing MNIST digits using softmax regression (required); advanced tutorial on training a feedforward NN for MNIST (recommended)
Monday, Dec 4 TensorFlow cont'd Chapter 6 of Deep Learning by Goodfellow, Bengio and Courville (recommended) HW4 out; Slides; Softmax regression demo; Multilayer CNN demo
Wednesday, Dec 6 TensorFlow cont'd Chapter 6 of Deep Learning by Goodfellow, Bengio and Courville (recommended) HW3 due
Monday, Dec 11 Advanced UNIX J. Janssens Data Science at the Command Line (recommended); S. Das (2005, 2012). Your UNIX: the Ultimate Guide. McGraw-Hill. (recommended); Sed manual (recommended); GNU awk user’s guide (recommended) Slides