STATS507: Data Science in Python, Fall 2019

This course will survey the tools and frameworks currently popular in industry and academia for collecting and analyzing data, focusing on the Python programming language. Topics will include an introduction to Python, data visualization using matplotlib, dealing with structured data, basic UNIX command line tools, distributed computing using Hadoop and Spark, and building statistical models using TensorFlow.

  Instructor: Keith Levin, kdlevin | at | umich | dot | edu
GSIs: Roger Fan, rogerfan | at | umich | dot | edu, Su I Iao, iaosui | at | umich | dot | edu
Lectures: Tuesdays and Thursdays 4:00pm to 5:30pm in G390 DENT
Instructor Office Hours: Wednesdays, 1:30pm to 3:00pm in 313 WH, or by appointment
GSI Office Hours: Mondays and Tuesdays 2:00pm to 3:30pm in 2165 USB
Textbook: There is no required textbook. We will make frequent reference early in the course to Allen B. Downey's textbook Think Python and Charles Severance's Python for Everybody (previously Python for Informatics). See below for weekly readings.
Syllabus: Available here
Prerequisites: There are no formal prerequisites for this course. Previous exposure to a programming language, preferably Python, is recommended.
Homework submission instructions: Instructions for submitting homeworks as well as a homework template can be found here.

Course Schedule

Date Topics Readings Notes
Tuesday, Sep 3 Course introduction; Administrivia; Intro to Python: data types and function definitions Jupyter notebook documentation (required); either A. B. Downey, Chapters 1 through 3 or Severance, Chapters 1, 2 and 4 (required) HW1 out; Slides; Notebook
Thursday, Sep 5 Intro to Python: conditionals, iteration and recursion Either A. B. Downey, Chapters 5, 6 and 7 or Severance, Chapters 4 and 5 (required); Python documentation on compound statements (recommended) Slides; Notebook
Tuesday, Sep 10 Intro to Python: Strings and Lists Either A. B. Downey, Chapters 8 and 10 or Severance, Chapters 6 and 8 (required); A. B. Downey, Chapter 9 (recommended); Python documentation on lists (recommended); Python documentation on sequences (recommended) HW2 out; Slides; Notebook
Thursday, Sep 12 Intro to Python: Dictionaries and Tuples Either A. B. Downey, Chapters 11 and 12 or Severance, Chapters 9 and 10 (required); Python documentation on dictionaries (recommended); Python documentation on tuples (recommended); Python documentation on sets (recommended); A. B. Downey, Section B.4 (recommended); A. B. Downey, Chapter 13 (recommended) Slides; Notebook
Tuesday, Sep 17 File I/O and Objects A. B. Downey, Chapter 14 or Severance, Chapter 7 (required); Python File I/O Documentation (required); Handling Errors and Exceptions (required); Python pickle module (recommended); A. B. Downey, Chapters 15 and 16 (required); Python documentation on classes (only through section 9.3) (required); D. Phillips (2015). Python 3 Object-oriented Programming, Second Edition. Packt Publishing. (recommended); M. Weisfeld (2009). The Object-Oriented Thought Process, Third Edition. Addison-Wesley. (recommended) HW3 out; Slides; Notebook
Thursday, Sep 19 File I/O and Objects (cont'd)
Tuesday, Sep 24 File I/O and Objects (cont'd)
Thursday, Sep 26 No lecture due to travel. No instructor office hours on Wednesday.
Tuesday, Oct 1 Functional programming: itertools and functools Python itertools documentation (required); Python functools documentation (required); A. M. Kuchling. Functional Programming HOWTO (required); M. R. Cook. A Practical Introduction to Functional Programming (recommended); D. Mertz Functional Programming in Python (recommended) HW4 out; Slides; Notebook
Thursday, Oct 3 Functional programming (cont'd)
Tuesday, Oct 8 numpy, SciPy and matplotlib Numpy quickstart tutorial (required); Pyplot tutorial (required); SciPy tutorial (recommended); Pyplot API (recommended); E. Tufte (2001). The Visual Display of Quantitative Information. Graphics Press. (recommended); E. Tufte (1997). Visual and Statistical Thinking: Displays of Evidence for Making Decisions. Graphics Press. (recommended) Slides; Notebook
Thursday, Oct 10 matplotlib (cont'd); Python pandas pandas quickstart guide (required); Basic data structures (required); Basic functionality of pandas Series and DataFrames (required); pandas cookbook (recommended) HW5 out; Slides; Notebook
Tuesday, Oct 15 Fall study break. No lecture. HW6 out; Instructor office hours as usual.
Thursday, Oct 17 Python pandas (cont'd) pandas group-by operations (required); Reshaping and pivoting (required); Merge, join and concatenation (recommended); Time series functionality (recommended) Slides; Notebook
Tuesday, Oct 22 Regular expressions Severance Chapter 11: Regular expressions (required); Python regex documentation (recommended) HW7 out; Slides; Notebook
Thursday, Oct 24 Markup languages; HTML, XML, JSON Severance Chapter 12 (HTTP, HTML) and Chapter 13 (XML, JSON) (required); BeautifulSoup documentation (just Quick Start) (required); BeautifulSoup documentation (everything up to sections about CSS) (recommended); BeautifulSoup4 tutorial (recommended) Slides; Notebook
Tuesday, Oct 29 Interacting with Databases: SQL Oracle relational databases overview (and only the overview!) (required); First section of Python sqlite3 documentation (required); w3schools SQL tutorial (recommended) Slides; Notebook
Thursday, Oct 31 UNIX/Linux command line Introduction to UNIX Commands (required); Survival guide for UNIX newbies (recommended); GNU/Linux Command-Line Tools Summary (recommended); M. Shelley (1818). Frankenstein; or, The Modern Prometheus (recommended) Slides
Tuesday, Nov 5 Introduction to Hadoop and MapReduce J. Dean and S. Ghemawat MapReduce: Simplified Data Processing on Large Clusters in Proceedings of the Sixth Symposium on Operating System Design and Implementation, 2004 (required); Introduction to HDFS by J. Hanson (recommended) HW8 out; Slides
Thursday, Nov 7 MapReduce in Python: mrjob mrjob Fundamentals and Concepts (required); Hadoop wiki: How MapReduce operations are actually carried out (required) Slides; mrjob demo code
Tuesday, Nov 12 mrjob (cont'd); MapReduce in Python: PySpark Spark programming guide (required); PySpark programming guide (required); Spark MLlib, a Spark machine learning library (recommended); Spark GraphX, a Spark library for processing graph data (recommended) Slides; demo code
Thursday, Nov 14 MapReduce in Python: PySpark (cont'd)
Tuesday, Nov 19 Algorithms, Profiling and Debugging A. B. Downey, Appendix B (required); Python cProfile/Profile documentation (recommended); Python unittest documentation (recommended) HW9 out; Slides; Notebook; Demo files
Thursday, Nov 21 Command line: part 2 Data Science at the Command Line by J. Janssens (recommended); S. Das (2005, 2012). Your UNIX: the Ultimate Guide. McGraw-Hill. (recommended); Sed manual (recommended); GNU awk user’s guide (recommended) Slides
Tuesday, Nov 26 scikit-learn sklearn quickstart tutorial (required); sklearn user-guide (recommended) Slides; Notebook
Thursday, Nov 28 No lecture: Thanksgiving break Listen to Arlo Guthrie's Alice's Restaurant (recommended); Tune in to your local NPR station on Friday, November 29 to listen to the Ig Nobel Prize ceremony (recommended) No instructor office hours Wednesday, November 27
Tuesday, Dec 3 Google TensorFlow Introduction to Low-Level TensorFlow API (required); Abadi, et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems (required); Assorted tutorials on statistical and neural models in TensorFlow (recommended) HW10 out; Slides; Notebook
Thursday, Dec 5 TensorFlow (cont'd) Slides; Demo: Digit Recognition with Softmax Classifier; Demo: Digit Recognition with Convolutional Neural Net
Tuesday, Dec 10 TensorFlow (cont'd)