STAT606: Computing for Data Science and Statistics, Spring 2023

This course provides a survey of some of the tools and frameworks that are currently popular among data scientists and statisticians working in both academia and industry. Our focus will be on complementing the tools that students are already familiar with from their previous courses on R. The course will begin with an accelerated introduction to the Python programming language and brief introductions to object-oriented and functional programming. We will then cover some of the scientific computing platforms available in Python, including numpy, scipy and scikit-learn, as well as visualization using matplotlib. We will then turn to discussing collecting data from the web both by scraping and using APIs. The course will conclude with a brief survey of distributed computing, focusing on Hadoop and Google Cloud Platform.

  Instructor: Keith Levin, kdlevin | at | wisc | dot | edu
Lectures: MW 9:30AM-10:45AM in Educational Sciences 212
Office Hours: Mondays and Wednesdays 11am-noon in MSC 6170, or by appointment
Textbook: There is no required textbook. See below for weekly readings.
Syllabus: Available here.
Prerequisites: there are no formal prerequisites for this course. Previous experience with programming in R, the UNIX/Linux command line, text editing in vim/emacs, regular expressions and distributed computing (equivalent to STAT605) is assumed.

Date Topics Readings Notes and Resources
Week 0: Jan 24-27
  • Course introduction and Administrivia
  • Installing and running Python and Jupyter
  • Jupyter notebook documentation (required)
  • HW00
  • Administrivia slides
  • Week 1: Jan 30 - Feb 3
  • Basic Python: types, variables and functions
  • Basic Python: conditionals and iteration
  • Either A. B. Downey, Chapters 1 through 3 or Severance, Chapters 1, 2 and 4 (required)
  • Either A. B. Downey, Chapter 5 or Severance, Chapters 3 and 5
  • Playlist: types, variables and functions; Slides; Demo code
  • Lec01 in-class exercises
  • Playlist: conditionals and iteration; Slides; Demo code
  • Lec02 in-class exercises
  • Week 2: Feb 6-10
  • Sequence data: strings, lists and tuples
  • List comprehensions
  • Python dictionaries and hashing
  • Week 3: Feb 13-15
  • Files and I/O
  • Python on the Command Line
  • Week 4: Feb 20-24
  • Basics of object-oriented programming
  • Classes and instances
  • Methods and attributes
  • Week 5: Feb 27-Mar 3
  • Basic concepts in functional programming
  • Map, reduce and filter
  • Week 6: Mar 6-10
  • numpy, scipy and matplotlib
  • Week 7: Mar 13-17 Spring Break. No lecture.
    Week 8: Mar 20-24
  • Python pandas
  • Week 9: Mar 27-Mar 31
  • Markup languages: HTML, XML and JSON
  • Week 10: Apr 3-5
  • Databases and SQL
  • Retrieving data with APIs
  • Week 11: Apr 10-14
  • Introduction to Hadoop and MapReduce
  • MapReduce using mrjob
  • Week 12: Apr 17-21 MapReduce using PySpark
    Week 13: Apr 24-28 Google TensorFlow and Keras
    Week 14: May 1-5 Google TensorFlow and Keras, cont'd