STAT606: Computing for Data Science and Statistics, Spring 2023

This course provides a survey of some of the tools and frameworks that are currently popular among data scientists and statisticians working in both academia and industry. Our focus will be on complementing the tools that students are already familiar with from their previous courses on R. The course will begin with an accelerated introduction to the Python programming language and brief introductions to object-oriented and functional programming. We will then cover some of the scientific computing platforms available in Python, including numpy, scipy and scikit-learn, as well as visualization using matplotlib. We will then turn to discussing collecting data from the web both by scraping and using APIs. The course will conclude with a brief survey of distributed computing, focusing on Hadoop and Google Cloud Platform.

  Instructor: Keith Levin, kdlevin | at | wisc | dot | edu
Lectures: MW 9:30AM-10:45AM in Educational Sciences 212
Office Hours: Mondays and Wednesdays 11am-noon in MSC 6170, or by appointment
Textbook: There is no required textbook. See below for weekly readings.
Syllabus: Available here.
Prerequisites: there are no formal prerequisites for this course. Previous experience with programming in R, the UNIX/Linux command line, text editing in vim/emacs, regular expressions and distributed computing (equivalent to STAT605) is assumed.

Date Topics Readings Notes and Resources
Week 0: Jan 24-27
  • Course introduction and Administrivia
  • Installing and running Python and Jupyter
  • Jupyter notebook documentation (required)
  • HW00
  • Administrivia slides
  • Week 1: Jan 30 - Feb 3
  • Basic Python: types, variables and functions
  • Basic Python: conditionals and iteration
  • Either A. B. Downey, Chapters 1 through 3 or Severance, Chapters 1, 2 and 4 (required)
  • Either A. B. Downey, Chapter 5 or Severance, Chapters 3 and 5
  • HW01
  • Playlist: types, variables and functions; Slides; Demo code
  • Lec01 in-class exercises
  • Playlist: conditionals and iteration; Slides; Demo code
  • Lec02 in-class exercises
  • Week 2: Feb 6-10
  • Sequence data: strings, lists and tuples
  • List comprehensions
  • Python dictionaries and hashing
  • Either A. B. Downey, Chapters 8 and 10 or Severance, Chapters 6 and 8 (required); A. B. Downey, Chapter 9 (recommended)
  • Python documentation on lists (recommended); Python documentation on sequences (recommended)
  • Either A. B. Downey, Chapters 11 and 12 or Severance, Chapters 9 and 10 (required)
  • Python documentation on dictionaries (recommended)
  • Python documentation on tuples (recommended)
  • Python documentation on sets (recommended)
  • A. B. Downey, Section B.4 (recommended); A. B. Downey, Chapter 13 (recommended)
  • HW02
  • Playlist: strings, lists and sequence data; Slides; Demo code
  • Lec03 in-class exercises
  • Playlist: dictionaries and tuples; Slides; Demo code
  • Lec04 in-class exercises
  • Week 3: Feb 13-15
  • Files and I/O
  • Python on the Command Line
  • A. B. Downey, Chapter 14 or Severance, Chapter 7 (required)
  • Python File I/O Documentation (required)
  • Handling Errors and Exceptions (required)
  • Python pickle module (recommended)
  • Overview of the Python interpreter (recommended)
  • Calling Python from the command line (recommended)
  • Python sys module (recommended)
  • HW03
  • Playlist: Files and I/O; Slides; Demo code
  • Lec05 in-class exercises
  • Playlist: Python on the Command Line; Slides; Demo code
  • Lec06 in-class exercises
  • Week 4: Feb 20-24
  • Basics of object-oriented programming
  • Classes and instances
  • Methods and attributes
  • A. B. Downey, Chapters 15 and 16 or Severance Chapter 14 (required)
  • Python documentation on classes (only through section 9.3) (required)
  • D. Phillips (2015). Python 3 Object-oriented Programming, Second Edition. Packt Publishing. (recommended)
  • M. Weisfeld (2009). The Object-Oriented Thought Process, Third Edition. Addison-Wesley. (recommended)
  • HW04
  • Playlist: Objects and Classes; Slides; Demo code
  • Lec07 in-class exercises
  • Lec08 in-class exercises
  • Week 5: Feb 27-Mar 3
  • Basic concepts in functional programming
  • Map, reduce and filter
  • Python itertools documentation (required)
  • Python functools documentation (required)
  • A. M. Kuchling. Functional Programming HOWTO (required)
  • M. R. Cook. A Practical Introduction to Functional Programming (recommended)
  • D. Mertz Functional Programming in Python (recommended)
  • HW05
  • Playlist: Functional Programming; Slides; Demo code
  • Lec09 in-class exercises
  • Lec10 in-class exercises
  • Week 6: Mar 6-10
  • numpy, scipy and matplotlib
  • Numpy quickstart tutorial (required)
  • SciPy tutorial (recommended)
  • Pyplot tutorial (required)
  • Pyplot API (recommended)
  • E. Tufte (2001). The Visual Display of Quantitative Information. Graphics Press. (recommended)
  • E. Tufte (1997). Visual and Statistical Thinking: Displays of Evidence for Making Decisions. Graphics Press. (recommended)
  • HW06
  • Playlist: numpy and scipy; Slides; Demo code
  • Playlist: matplotlib; Slides; Demo code
  • Lec11 in-class exercises
  • Lec12 in-class exercises
  • Mar 13-17 Spring Break. No lecture.
    Week 7: Mar 20-24
  • Python pandas
  • pandas quickstart guide (required)
  • Basic data structures (required)
  • Basic functionality of pandas Series and DataFrames (required)
  • pandas group-by operations (required)
  • Reshaping and pivoting (required)
  • pandas cookbook (recommended)
  • Merge, join and concatenation (recommended)
  • Time series functionality (recommended)
  • HW07
  • Playlist: pandas; Slides; Demo code; CSV file for demo code
  • Lec13 in-class exercises
  • Lec14 in-class exercises
  • Week 8: Mar 27-Mar 31
  • Markup languages: HTML, XML and JSON
  • Severance Chapter 12 (HTTP, HTML) and Chapter 13 (XML, JSON) (required)
  • BeautifulSoup documentation (Quick Start up to "CSS sleectors...") (required)
  • BeautifulSoup4 tutorial (recommended)
  • HW08
  • Playlist: markup languages; Slides; Demo code
  • Lec15/16 in-class exercises
  • Week 9: Apr 3-5
  • Databases and SQL
  • Retrieving data with APIs
  • Oracle relational databases overview (and only the overview!) (required)
  • First section of Python sqlite3 documentation (required)
  • w3schools SQL tutorial (recommended)
  • Wikipedia: Web APIs (recommended)
  • Overview of HTTP Request Methods (recommended)
  • HW09
  • Playlist: Databases and SQL; Slides; Demo code
  • Playlist: Web APIs; Slides; Demo code
  • Lec17 in-class exercises
  • Lec18 in-class exercises
  • Week 10: Apr 10-14
  • Introduction to Hadoop and MapReduce
  • MapReduce using mrjob
  • J. Dean and S. Ghemawat MapReduce: Simplified Data Processing on Large Clusters in Proceedings of the Sixth Symposium on Operating System Design and Implementation, 2004 (required)
  • HDFS Architecture Guide (recommended)
  • mrjob Fundamentals and Concepts (required)
  • Hadoop wiki: How MapReduce operations are actually carried out (required)
  • HW10
  • Playlist: MapReduce; Slides
  • Playlist: mrjob; Slides; Demo code
  • Lec19/20 in-class exercises
  • Week 11: Apr 17-21 MapReduce using PySpark
  • PySpark quickstart (required)
  • RDD programming guide (required)
  • Spark MLlib, a Spark machine learning library (recommended)
  • Spark GraphX, a Spark library for processing graph data (recommended)
  • HW11
  • Playlist: PySpark; Slides; Demo code
  • Lec21 in-class exercises
  • Lec22 in-class exercises
  • Week 12: Apr 24-28 Google TensorFlow and Keras
  • Guide: TensorFlow Basics (required)
  • TensorFlow Estimators API (recommended)
  • HW12
  • Playlist: TensorFlow; Slides; Demo code
  • Lec23 in-class exercises
  • Lec24 in-class exercises
  • Week 13: May 1-5 Google TensorFlow and Keras, cont'd
  • Modules, layers and models in TensorFlow (required)
  • Training loops in TensorFlow (required)
  • Specifying models with Keras (recommended)
  • Playlist: Building and Training Models in TensorFlow; Slides; Demo code
  • Lec25/26 in-class exercises