STAT 304 Homework 3

We’ll grade your homework by opening your “HW3.Rmd” file in RStudio (in a directory containing “farm.csv”), clicking “Knit”, reading the HTML output, and reading your “HW3.Rmd” file. You should write R code anywhere you see an empty R code chunk.

Name: …

Email: …@wisc.edu

Part 1: A “jackknife” procedure to find the most outlying point in a linear relationship between two variables

First load the “XML” package to give access to readHTMLTable() and the “curl” package for access to curl().

if (!require("XML")) {
  install.packages("XML") # do this once per lifetime
  stopifnot(require("XML")) # do this once per session
}

## Loading required package: XML

if (!require("curl")) {
  install.packages("curl") # do this once per lifetime
  stopifnot(require("curl")) # do this once per session
}

## Loading required package: curl

Use R to get the land area (sq. miles) of each of the 50 states from the web page https://simple.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_area. Hint: you can use readHTMLTable(readLines(curl("https://simple.wikipedia.org/wiki/List_of_U.S._states_by_area")), stringsAsFactors=FALSE) to read the data. Include code to select only the 50 states and to remove the commas from the numbers.

Use R to get farm areas of states from “farm.csv”.

Create a data frame called “area” whose columns are “state”, “farm”, and “land”, which contain state names, farm areas, and land areas, respectively. Hint: the states aren’t in the same order in the two data sets, so getting the “area” data frame right requires a little care.

Make a scatterplot of y = farm area vs. x = land area.

There are two prominent outliers. Use identify() to find their indices. Unfortunately, identify() doesn’t work on an R graph that we’re viewing through an HTML page. To find the outliers, run your code in the Console so you can click on the graph in RStudio’s “Plots” tab. Once you know the indices of the outliers, just assign them to variables so you can use them later.

The two outliers are Texas, which fits the roughly linear trend of the rest of the data, and Alaska, which does not fit.

Make a linear model of y = farm area vs. x = land area. Make your scatterplot again, and this time add the regression line to it. Then make a linear model of the same data, except with Alaska removed. Add that regression line, colored red, to your scatterplot.

Notice that, with respect to the original regression line, Texas has the biggest residual (difference in actual and predicted y), because Alaska pulled the line down toward itself. But really Alaska is the outlier! Next we’ll do a “jackknife” procedure to discover computationally that Alaska is the most important outlier.

Make a plot of the residuals for the original model. (Hint: they’re available in the output of lm().)

Notice again that the Texas residual is bigger than the Alaska residual.

Next use a loop to create n=50 models. In step i, make a model of the data with observation i removed. Then predict the value of y[i] from that model, and find the residual (difference) between (the removed) y[i] and the prediction. Save these residuals in a vector r.jack. (A “jackknife” procedure works by removing one observation (or several) from a data set, and then making a prediction from that smaller data set, and repeating this for each observation.)

Plot these “jackknife” residuals.

Notice now that Alaska is clearly the real outlier.

Part 2: Web-scraping

Here we figure out which people produced the most movies in the IMDB Top Rated Movies list. (An example related to this search is the NFL web scraping code discussed in lecture.)

rm(list=ls())

First load the “XML” package to give access to readHTMLTable() and the “curl” package for access to curl().

if (!require("XML")) {
  install.packages("XML") # do this once per lifetime
  stopifnot(require("XML")) # do this once per session
}
if (!require("curl")) {
  install.packages("curl") # do this once per lifetime
  stopifnot(require("curl")) # do this once per session
}

At the bottom of the Internet Movie Database website there’s a link to the Top Rated Movies. At this page there’s a list of 250 movies, with a link to each movie. The first movie is The Shawshank Redmption.

With your browser on the “Top Rated Movies” page, you can do “right-click > view page source” (in Firefox or Chrome; in Safari, first do “Safari > Preferences > Advanced” and check “Show Develop menu in menu bar”; then do Develop > Show Page Source) to see the HTML code that creates the page. (You do not need to learn HTML for this homework.)

Search in the HTML source page for “Shawshank”, and you’ll see that it occurs twice, once in an <img .../> tag and once in a <a.../a> tag. Search for “Godfather”, and you’ll see that it occurs four times, twice for “The Godfather” and twice “The Godfather: Part II”. For each of these three <a...</a> lines, the preceding line contains a link, relative to the main IMDB URL, to that movie’s page. Use grep() to figure out what small string is common to the 250 lines, like these three, that contain links to the top 250 movies.

Notice that the second line for “The Shawshank Redemption” includes the text “/title/tt0111161”. Pasting this onto “http://www.imdb.com” gives “http://www.imdb.com/title/tt0111161”, which is a link to the first movie’s page. Adding “/fullcredits” gives “http://www.imdb.com/title/tt0111161/fullcredits”, which is a link to the full cast and crew. Search this “fullcredits” page for “Produced” and you’ll see that “The Shawshank Redemption” was produced by “Liz Glotter”, “David V. Lester”, and “Niki Marvin”.

Write code that does the following:

Use readLines() to read “http://www.imdb.com/chart/top” into a character string vector
- Select the 250 lines containing links to the 250 movies
- From these 250 lines, select the 250 strings like “/title/tt0111161” from which you can form links to the 250 movies (well, there seem to be 251 lines, and then 251 strings that contain one duplicate; see ?unique to remove the duplicate)
Create an empty list of producers, e.g. producers = list()
For each movie, read its “fullcredits” page
- Strip out the title of the movie
- Use readHTMLTable(readLines(curl())) to read all the tables into a list of dataframes; figure out which dataframe has the producers; you will need to replace “http” with “https” in each movie’s fullcredits URL (like “http://www.imdb.com/title/tt0111161/fullcredits”) to get readHTMLTable(readLines(curl())) to work
- Save the vector of producers in a list, doing something like producers[[title]] = ..., where ... is the vector of producers you found
Do unlist(producers) to convert your list of title / producer vector pairs into a named vector of producers.
- Use table() to make a table of counts from this vector
- Display the 5 producers who produced the most movies from among these 250

Extra Practice (not required)

Collect Year, Director, Rating, Number of Votes and Cast (first billed only)
For each actor, count how many times he or she starred in a Top 250 Movie. Show the 10 actors/actresses that starred in the most movies among the Top 250. Show the 10 actors/actresses that starred in movies among the Top 250 with the highest mean rating.
For each director, count how many times he or she directed a Top 250 Movie. Show the 10 directors that directed the most movies among the Top 250. Show the 10 directors that directed movies among the Top 250 with the highest mean rating.
Show the 10 most frequent Actor-Director collaborations among the Top 250 Movies. What’s the average rating for those collaborations?
Are ratings influenced by year? In what way? Provide a P-value using linear regression. Are the assumptions of linear regression violated? If so, what’s the impact in your P-value estimate?
Do people vote more often for recent movies? Provide a P-value using linear regression. Are the assumptions of linear regression violated? If so, what’s the impact in your P-value estimate?
In light of the previous question, do you think the number of votes influences the rating? Create an analysis of variance table for the ratings, considering year, votes and the interaction of year and votes. Explain what the interaction means.