email: first name and last name at stat dott wisc dott edu
office: 6110 MSC
Syllabus, R labs in ISLR, Project Timeline, Project description
Week 6
- Make groups for the final project.
- Practice midterm. Complete this practice, compile into html, and turn in as a homework assignment before Monday Oct 18 at 11pm.
- See the Project Timeline
Week 5
in class review exercise:
Make a plot with three panels, one for EWR, LGA, and JFK. The horizontal axes should give the date. The vertical axis should give the proportion of flights canceled. Add another line that corresponds to a weather variable.
Topics
- Midterm Tu Oct 19.
- Has anyone formed a group for a project?
- recap:
pivot_wider
and pivot_longer
- recap: Plotting with tidy data.
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(aes(<MAPPINGS>), position = <POSITION>) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>
- recap: Handling relational data. Chapter 13 in r4ds.
- Shiny is great. Example,
library(shiny)
runGitHub("shiny_example", "rstudio")
Your homework for next week is to build this…
library(shiny)
runUrl( "http://pages.stat.wisc.edu/~karlrohe/ds479/code/census-app.zip")
- But first, introduction to Shiny <- pretty. my summary (which is not pretty).
Homework
Go to github.com. Make an empty repository called CensusApp. Clone the repo in Rstudio. Then, go through Lesson 5 to construct the app in that repository. Commit and push your repository. Then, turn in the command
runGitHub("CensusApp", "<your github user name>")
that runs your code! See here for more details (if that is hard).
Week 4
Topics
- Midterm Tu Oct 19.
- When thinking about your project… start from a place of genuine interest. Then, look for data in that area.
- Review exercise: Make a table where each row is a day of the year. The first column is the date. The 2:4 columns give the number of (scheduled) departures from EWR, LGA, and JFK.
- Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.
- Dates are a common statistical medium (a way of recording a thing). Whenever you come across another statistical medium, try to find a good package to deal with it! See lubridate in this code.
- organizing, or “tidying” your data.
- Plotting with tidy data.
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(aes(<MAPPINGS>), position = <POSITION>) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>
- Handling relational data. Chapter 13 in r4ds.
Readings
Chapters 12 and 3in r4ds. Then, chapter 13.
Homework
Due October 5th, by midnight: In r4ds flights… What time of day should you fly if you want to avoid delays as much as possible? Does this choice depend on anything? Season? Weather? Airport? Airline? Find three patterns (“null results” are ok!). Write your results into Rmarkdown. Include a short introduction that summarizes the three results. Then, have a section for each finding. Support each finding with data summaries and visualizations. Include your code when necessary. This shouldn’t be long, but it might take some time to find the things you want to talk about and lay them out in an orderly way.
Week 3
Topics
- Midterm: Th Oct 14 or Tu Oct 19?
- What types of data are you interested in studying (e.g. in class project)?
- Handling/wrangling data with dplyr. In these exercises, we manipulate a 40 MB file (330k rows and 19 columns). For each of the operations, you should already know a “substitute” that you could code already. As such, there is a fixed cost to learning something new. Moreover, the benefits will not be immediately apparent until we study
%>%
. It is worth it, I promise.
- When you are using agile syntax, what is your stance/mode-of-being? (this is a “trick question” that is fundamental to this course.)
- Note: if you have a SQL database, you can use dbplyr. It allows you to use remote database tables as if they are in-memory data frames by automatically converting dplyr code into SQL. There are a lot of other extensions for other database ish things.
- Which plane (tailnum) has the worst on-time record?
- What time of day should you fly if you want to avoid delays as much as possible?
- Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time of a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?
Reading
Chapter 5 in r4ds. Please include a link to your github repo in the html file that you turn in.
Homework / in class exercise
- How many flights have a missing dep_time? What other variables are missing? What might these rows represent?
- Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.
- Look at the number of canceled flights per day. Is there a pattern? Is the proportion of canceled flights related to the average delay? Use multiple dyplr operations, all on one line, concluding with
ggplot(aes(x= ,y=)) + geom_point()
These questions come from r4ds. Please turn them in Tuesday (9/28) at 5pm in canvas.
Week 2
Topics
- Syllabus
- Get github setup
- For the first part of the course, we will use r4ds as our text. If you have used it before, that’s because it is a super great text.
- Reading data is hard, particularly when it comes from “outside”. This is particularly frustrating because it is not the part we think about. When I hear about a data set, I immediately start thinking about how I want to analyze it. Then, sometimes it take 3 days (or 3 months) to simply load the data. So frustrating!
- Let’s try to read some bad data. Feel free to talk with your neighbor.
- Reflections on reading bad data
Readings
Chapter 11 in r4ds.
Note… You are required to do these readings. If you do not, then class will not be fun.
Homework
Get github setup
Load either of the following data into a data frame (e.g. tibble) or csv structure. Post your code and data in a github repo. For a small project like this, I like to put the code in the github README with Rmarkdown. Put the README html file in the dropbox on canvas.
- Download some bridges data.
- For each person listed here, we want a row in a table. That table should have four columns: name, position, department, degree information. I wrote a short script to do this. If you’ve never seen html before, my code will look weird, but it isn’t that bad (I personally don’t know html! eek). The basic idea is that the things we want are divided up by things like
<br/></p></li><li><p>
and <ul class=\"uw-people\"><li><p>
and <br/>
. I know just enough html to see that those chunks split up the things we want. So, my code gets the raw html, then uses strsplit on those little chunks of html code. strsplit makes a list, so I also use unlist. However, my code is long and not so agile. The great thing about R is that often there are packages that do stuff like that for you. So, bonus points for anyone that can make really easy code to get that data.
Week 1
Topics
- Themes of Data Science
- Think of this course as a key resource that, if used effectively, will help you create your own digital portfolio. Put it on github and show it off!
- “Five stances” and “DS is a performance” could have been said about “applied statistics” 50 years ago. Is “DS” just marketing? (1) who cares. (2) the “rebranding” has aligned (in time) with a massive change in the way that we perform applied statistics. To me, Data Science is data analysis in the age of the internet. This has two major implications.
- Course pedagogy: I will rarely lecture (like I just did). Instead, we will have labs that we will perform individually, in small groups; everyone is coding there own code, but we are discussing in small groups. These labs will often have landmines (things which you don’t yet know how to do) and forks in the road (decisions that you must make). I want you to navigate these things your selves, in small groups. The aim is not to “teach you” how to do these isolated tasks. Rather, it is to help you figure out how to do things by yourselves. This is convenient because I am sometimes just as puzzled as you are! :) We will figure it out together. If this sound tedious or if you would prefer neat clean lectures, I would advise you to stick around and try this course! However, you are also welcome to find another class. If you are stuck here for graduation requirements and this all sounds like a nightmare, please let me know. Truly.
- Syllabus
- Let’s choose our own adventure… where should we go from here? What do you think you should learn in this course? What do you want to learn?