STATS507: Data Analysis in Python, Winter 2019

This course will survey the tools and frameworks currently popular in industry and academia for collecting and analyzing data, focusing on the Python programming language. Topics will include an introduction to Python, data visualization using matplotlib, dealing with structured data, basic UNIX command line tools, distributed computing using Hadoop and Spark, and building statistical models using TensorFlow.

  Instructor: Keith Levin, klevin | at | umich | dot | edu
GSI: Roger Fan, rogerfan | at | umich | dot | edu
Lectures: Wednesdays and Fridays 8:30am to 10:00am in DANA1040
Instructor Office Hours: Wednesdays and Fridays, 10:00am to 11:30am in West Hall 438, or by appointment
GSI Office Hours: Mondays, 1:00pm to 4:00pm in USB 2165
Textbook: There is no required textbook. We will make frequent reference early in the course to Allen B. Downey's textbook Think Python and Charles Severance's Python for Everybody (previously Python for Informatics). See below for weekly readings.
Syllabus: Available here
Prerequisites: There are no formal prerequisites for this course. Previous exposure to a programming language, preferably Python, is recommended.
Homework submission instructions: Instructions for submitting homeworks as well as a homework template can be found here.

Course Schedule

Date Topics Readings Notes
Wednesday, Jan 9 Course introduction; Administrivia; Intro to Python: data types and function definitions Jupyter notebook documentation (required); either A. B. Downey, Chapters 1 through 3 or Severance, Chapters 1, 2 and 4 (required) HW1 out; Slides; Notebook
Friday, Jan 11 Intro to Python: conditionals, iteration and recursion Either A. B. Downey, Chapters 5, 6 and 7 or Severance, Chapters 4 and 5 (required); Python documentation on compound statements (recommended) Slides; Notebook
Wednesday, Jan 16 Intro to Python: Strings and Lists Either A. B. Downey, Chapters 8 and 10 or Severance, Chapters 6 and 8 (required); A. B. Downey, Chapter 9 (recommended); Python documentation on lists (recommended); Python documentation on sequences (recommended) HW2 out; Slides; Notebook
Friday, Jan 18 Intro to Python: Dictionaries and Tuples Either A. B. Downey, Chapters 11 and 12 or Severance, Chapters 9 and 10 (required); Python documentation on dictionaries (recommended); Python documentation on tuples (recommended); Python documentation on sets (recommended); A. B. Downey, Section B.4 (recommended); A. B. Downey, Chapter 13 (recommended) No office hours. Slides; Notebook
Wednesday, Jan 23 File I/O and Objects A. B. Downey, Chapter 14 or Severance, Chapter 7 (required); Python File I/O Documentation (required); Handling Errors and Exceptions (required); Python pickle module (recommended); A. B. Downey, Chapters 15 and 16 (required); Python documentation on classes (only through section 9.3) (required); D. Phillips (2015). Python 3 Object-oriented Programming, Second Edition. Packt Publishing. (recommended); M. Weisfeld (2009). The Object-Oriented Thought Process, Third Edition. Addison-Wesley. (recommended) HW3 out; Slides; Notebook
Friday, Jan 25 File I/O and Objects, cont'd A. B. Downey, Chapter 14 or Severance, Chapter 7 (required); Python File I/O Documentation (required); Handling Errors and Exceptions (required); Python pickle module (recommended); A. B. Downey, Chapters 15 and 16 (required); Python documentation on classes (only through section 9.3) (required); D. Phillips (2015). Python 3 Object-oriented Programming, Second Edition. Packt Publishing. (recommended); M. Weisfeld (2009). The Object-Oriented Thought Process, Third Edition. Addison-Wesley. (recommended)
Wednesday, Jan 30 Classes canceled due to extreme cold weather.
Friday, Feb 1 Functional programming: itertools and functools Python itertools documentation (required); Python functools documentation (required); A. M. Kuchling. Functional Programming HOWTO (required); M. R. Cook. A Practical Introduction to Functional Programming (recommended); D. Mertz Functional Programming in Python (recommended) HW4 out; Slides; Notebook
Wednesday, Feb 6 Functional programming, cont'd
Friday, Feb 8 numpy, SciPy and matplotlib Numpy quickstart tutorial (required); Pyplot tutorial (required); SciPy tutorial (recommended); Pyplot API (recommended); E. Tufte (2001). The Visual Display of Quantitative Information. Graphics Press. (recommended); E. Tufte (1997). Visual and Statistical Thinking: Displays of Evidence for Making Decisions. Graphics Press. (recommended) HW5 out; Slides; Notebook
Wednesday, Feb 13 Python pandas pandas quickstart guide (required); Basic data structures (required); Basic functionality of pandas Series and DataFrames (required); pandas cookbook (recommended) HW6 out; Slides; Notebook; Baseball dataset
Friday, Feb 15 Python pandas, cont'd pandas group-by operations (required); Reshaping and pivoting (required); Merge, join and concatenation (recommended); Time series functionality (recommended) Slides; Notebook
Wednesday, Feb 20 Regular expressions Severance Chapter 11: Regular expressions (required); Python regex documentation (recommended) HW7 out; Slides; Notebook
Friday, Feb 22 Markup languages; HTML, XML, JSON Severance Chapter 12 (HTTP, HTML) and Chapter 13 (XML, JSON) (required); BeautifulSoup documentation (just Quick Start) (required); BeautifulSoup documentation (everything up to sections about CSS) (recommended); BeautifulSoup4 tutorial (recommended) Slides; Notebook
Wednesday, Feb 27 Interacting with Databases: SQL Oracle relational databases overview (and only the overview!) (required); First section of Python sqlite3 documentation (required); w3schools SQL tutorial (recommended) Slides; Notebook
Friday, March 1 UNIX/Linux command line Introduction to UNIX Commands (required); Survival guide for UNIX newbies (recommended); GNU/Linux Command−Line Tools Summary (recommended) Slides
Wednesday, March 6 Winter break. No lecture. No office hours.
Friday, March 8 Winter break. No lecture. No office hours.
Wednesday, March 13 Introduction to Hadoop and MapReduce J. Dean and S. Ghemawat MapReduce: Simplified Data Processing on Large Clusters in Proceedings of the Sixth Symposium on Operating System Design and Implementation, 2004 (required); Introduction to HDFS by J. Hanson (recommended) Slides
Friday, March 15 MapReduce in Python: mrjob mrjob Fundamentals and Concepts (required); Hadoop wiki: How MapReduce operations are actually carried out (required) HW8 out; Slides; mrjob demo code
Wednesday, March 20 MapReduce in Python: mrjob cont'd
Friday, March 22 MapReduce in Python: PySpark Spark programming guide (required); PySpark programming guide (required); Spark MLlib, a Spark machine learning library (recommended); Spark GraphX, a Spark library for processing graph data (recommended) Slides
Wednesday, March 27 Pyspark contd'd; Overview of text editors nano overview (recommended); vim documentation (recommended); emacs documentation (recommended) wordcount example
Friday, March 29 Command line: part 2 Data Science at the Command Line by J. Janssens (recommended); S. Das (2005, 2012). Your UNIX: the Ultimate Guide. McGraw-Hill. (recommended); Sed manual (recommended); GNU awk user’s guide (recommended) Slides
Wednesday, April 3 Algorithms, Profiling and Debugging A. B. Downey, Appendix B (required); Python cProfile/Profile documentation (recommended); Python unittest documentation (recommended) HW9 out; Slides; Notebook; Demo files
Friday, April 5 scikit-learn sklearn quickstart tutorial (required); sklearn user-guide (recommended) Slides; Notebook
Wednesday, April 10 Google TensorFlow TensorFlow tutorial: Getting Started with TensorFlow (required); Abadi, et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems (required); Assorted tutorials on statistical and neural models in TensorFlow (recommended) HW10 out; Slides; Notebook
Friday, April 12 TensorFlow cont'd Chapter 6 of Deep Learning by Goodfellow, Bengio and Courville (recommended) Slides; Softmax regression demo; Multilayer CNN demo
Wednesday, April 17 TensorFlow cont'd
Friday, April 19 TensorFlow cont'd; APIs Getting started with the Python requests package (recommended); Mozilla overview of HTTP methods (recommended); RFC Specifying HTTP methods (recommended) Slides; Notebook