STATS701: Data Analysis in Python, Winter 2018

This topics course will will survey the tools and frameworks currently popular in industry and academia for collecting and analyzing data, focusing on the Python programming language. Topics will include an introduction to Python, data visualization using matplotlib, dealing with structured data, basic UNIX command line tools, distributed computing using Hadoop and Spark, and building statistical models using TensorFlow.

  Instructor: Keith Levin, klevin | at | umich | dot | edu
GSI: Roger Fan, rogerfan | at | umich | dot | edu
Lectures: Mondays and Wednesdays 11:30am to 1:00pm in BLAU 1580
Instructor Office Hours: Tuesdays and Thursdays, 11:30am to 1:00pm in WH313, or by appointment
GSI Office Hours: Wednesdays, 3pm to 6pm in USB 2165
Textbook: There is no required textbook. We will make frequent reference early in the course to Allen B. Downey's textbook Think Python and Charles Severance's Python for Everybody (previously Python for Informatics). See below for weekly readings.
Syllabus: Available here
Prerequisites: There are no formal prerequisites for this course. Previous exposure to a programming language, preferably Python, is recommended.
Homework submission instructions: Instructions for submitting homeworks as well as a homework template can be found here.

Course Schedule

Date Topics Readings Notes
Wednesday, Jan 3 Course introduction; Administrivia; Intro to Python: data types and function definitions Jupyter notebook documentation (required); either A. B. Downey, Chapters 1 through 3 or Severance, Chapters 1, 2 and 4 (required) HW1 out; Slides
Monday, Jan 8 Intro to Python: function definitions cont'd; conditionals, iteration and recursion Either A. B. Downey, Chapters 5, 6 and 7 or Severance, Chapters 4 and 5 (required); Python documentation on compound statements (recommended) Slides
Wednesday, Jan 10 Strings, Lists and Sequences Either A. B. Downey, Chapters 8 and 10 or Severance, Chapters 6 and 8 (required); A. B. Downey, Chapter 9 (recommended); Python documentation on lists (recommended); Python documentation on sequences (recommended) HW2 out; Slides
Monday, Jan 15 MLK Day. No Lecture. M. L. King, Jr. Where Do We Go from Here: Chaos or Community? (recommended)
Wednesday, Jan 17 Dictionaries Either A. B. Downey, Chapter 11 or Severance, Chapter 9 (required); Python documentation on dictionaries (recommended); A. B. Downey, Section B.4 (recommended) HW1 due; Slides
Monday, Jan 22 Dictionaries (cont'd); Tuples and Sets Either A. B. Downey, Chapter 12 or Severance, Chapter 10 (required); Python documentation on tuples (recommended); Python documentation on sets (recommended); A. B. Downey, Chapter 13 (recommended) Slides
Wednesday, Jan 24 File I/O A. B. Downey, Chapter 14 or Severance, Chapter 7 (required); Python File I/O Documentation (required); Handling Errors and Exceptions (required); Python pickle module (recommended) HW3 out; Slides
Monday, Jan 29 Object-oriented programming: Classes A. B. Downey, Chapters 15 and 16 (required); Python documentation on classes (only through section 9.3) (required); D. Phillips (2015). Python 3 Object-oriented Programming, Second Edition. Packt Publishing. (recommended); M. Weisfeld (2009). The Object-Oriented Thought Process, Third Edition. Addison-Wesley. (recommended) HW2 due; Slides
Wednesday, Jan 31 Object-oriented programming: Operators and Inheritance A. B. Downey, Chapters 17 and 18 (required); Python documentation on operators (recommended); Python coding style guide (recommended); Google Python style guide (recommended) HW4 out; Slides
Monday, Feb 5 Objects and inheritance, cont'd
Wednesday, Feb 7 Functional programming I: itertools Python itertools documentation (required); A. M. Kuchling. Functional Programming HOWTO (required); M. R. Cook. A Practical Introduction to Functional Programming (recommended); D. Mertz Functional Programming in Python (recommended) HW5 out; HW3 due; Slides
Monday, Feb 12 Functional programming II: functools Python functools documentation (required); A. M. Kuchling. Functional Programming HOWTO (required); M. R. Cook. A Practical Introduction to Functional Programming (recommended); D. Mertz Functional Programming in Python (recommended) Slides
Wednesday, Feb 14 Numpy, SciPy and matplotlib Numpy quickstart tutorial (required); Pyplot tutorial (required); SciPy tutorial (recommended); Pyplot API (recommended); E. Tufte (2001). The Visual Display of Quantitative Information. Graphics Press. (recommended); E. Tufte (1997). Visual and Statistical Thinking: Displays of Evidence for Making Decisions. Graphics Press. (recommended) HW6 out; HW4 due; Slides
Monday, Feb 19 Python pandas pandas quickstart guide (required); Basic data structures (required); Basic functionality of pandas Series and DataFrames (required); pandas cookbook (recommended) Slides
Wednesday, Feb 21 Python pandas, cont'd pandas group-by operations (required); Reshaping and pivoting (required); Merge, join and concatenation (recommended); Time series functionality (recommended) HW7 out; HW5 due; Slides
Monday, Feb 26 Winter break. No lecture. No office hours this week.
Wednesday, Feb 28 Winter break. No lecture. No office hours this week.
Monday, Mar 5 Regular expressions Severance Chapter 11: Regular expressions (required); Python regex documentation (recommended) Slides
Wednesday, Mar 7 Markup languages; HTML, XML, JSON Severance Chapter 12 (HTTP, HTML) and Chapter 13 (XML, JSON) (required); BeautifulSoup documentation (just Quick Start) (required); BeautifulSoup documentation (everything up to sections about CSS) (recommended); BeautifulSoup4 tutorial (recommended) HW 6 due; Slides
Monday, Mar 12 Interacting with Databases: SQL Oracle relational databases overview (and only the overview!) (required); First section of Python sqlite3 documentation (required); w3schools SQL tutorial (recommended) HW8 out; Slides
Wednesday, Mar 14 Introduction to the UNIX/Linux command line Introduction to UNIX Commands (required); Survival guide for UNIX newbies (recommended); GNU/Linux Command−Line Tools Summary (recommended) HW 7 due; Slides
Monday, Mar 19 Introduction to Hadoop and MapReduce J. Dean and S. Ghemawat MapReduce: Simplified Data Processing on Large Clusters in Proceedings of the Sixth Symposium on Operating System Design and Implementation, 2004 (required); Introduction to HDFS by J. Hanson (recommended) Slides
Wednesday, Mar 21 MapReduce in Python: mrjob mrjob Fundamentals and Concepts (required); Hadoop wiki: How MapReduce operations are actually carried out (required) HW9 out; Slides; Demo code
Monday, Mar 26 MapReduce in Python: PySpark Spark programming guide (required); PySpark programming guide (required); Spark MLlib, a Spark machine learning library (recommended); Spark GraphX, a Spark library for processing graph data (recommended) Slides
Wednesday, Mar 28 Pyspark cont'd HW8 due
Monday, April 2 Google TensorFlow TensorFlow tutorial: Getting Started with TensorFlow (required); Abadi, et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems (required); Assorted tutorials on statistical and neural models in TensorFlow (recommended) Slides
Wednesday, April 4 TensorFlow cont'd Chapter 6 of Deep Learning by Goodfellow, Bengio and Courville (recommended) HW10 out
Monday, April 9 TensorFlow cont'd Chapter 6 of Deep Learning by Goodfellow, Bengio and Courville (recommended) Slides; Softmax regression demo; Multilayer CNN demo
Wednesday, April 11 Advanced UNIX Data Science at the Command Line by J. Janssens (recommended); S. Das (2005, 2012). Your UNIX: the Ultimate Guide. McGraw-Hill. (recommended); Sed manual (recommended); GNU awk user’s guide (recommended) HW9 due; Slides
Monday, April 16 Graph Processing and Interacting with APIs Getting started with the Python requests package (recommended); Mozilla overview of HTTP methods (recommended); RFC Specifying HTTP methods (recommended); igraph tutorial (recommended) Slides