We’ll grade your homework by opening your “HW3.Rmd” file in RStudio (in a directory containing “farm.csv”), clicking “Knit”, reading the HTML output, and reading your “HW3.Rmd” file. You should write R code anywhere you see an empty R code chunk.
Name: …
Email: …@wisc.edu
First load the “XML” package to give access to readHTMLTable()
and the “curl” package for access to curl()
.
if (!require("XML")) {
install.packages("XML") # do this once per lifetime
stopifnot(require("XML")) # do this once per session
}
## Loading required package: XML
if (!require("curl")) {
install.packages("curl") # do this once per lifetime
stopifnot(require("curl")) # do this once per session
}
## Loading required package: curl
Use R to get the land area (sq. miles) of each of the 50 states from the web page https://simple.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_area. Hint: you can use readHTMLTable(readLines(curl("https://simple.wikipedia.org/wiki/List_of_U.S._states_by_area")), stringsAsFactors=FALSE)
to read the data. Include code to select only the 50 states and to remove the commas from the numbers.
Use R to get farm areas of states from “farm.csv”.
Create a data frame called “area” whose columns are “state”, “farm”, and “land”, which contain state names, farm areas, and land areas, respectively. Hint: the states aren’t in the same order in the two data sets, so getting the “area” data frame right requires a little care.
Make a scatterplot of y = farm area vs. x = land area.
There are two prominent outliers. Use identify()
to find their indices. Unfortunately, identify()
doesn’t work on an R graph that we’re viewing through an HTML page. To find the outliers, run your code in the Console so you can click on the graph in RStudio’s “Plots” tab. Once you know the indices of the outliers, just assign them to variables so you can use them later.
The two outliers are Texas, which fits the roughly linear trend of the rest of the data, and Alaska, which does not fit.
Make a linear model of y = farm area vs. x = land area. Make your scatterplot again, and this time add the regression line to it. Then make a linear model of the same data, except with Alaska removed. Add that regression line, colored red, to your scatterplot.
Notice that, with respect to the original regression line, Texas has the biggest residual (difference in actual and predicted y), because Alaska pulled the line down toward itself. But really Alaska is the outlier! Next we’ll do a “jackknife” procedure to discover computationally that Alaska is the most important outlier.
Make a plot of the residuals for the original model. (Hint: they’re available in the output of lm()
.)
Notice again that the Texas residual is bigger than the Alaska residual.
Next use a loop to create n=50 models. In step i, make a model of the data with observation i removed. Then predict the value of y[i] from that model, and find the residual (difference) between (the removed) y[i] and the prediction. Save these residuals in a vector r.jack
. (A “jackknife” procedure works by removing one observation (or several) from a data set, and then making a prediction from that smaller data set, and repeating this for each observation.)
Plot these “jackknife” residuals.
Notice now that Alaska is clearly the real outlier.
Here we figure out which people produced the most movies in the IMDB Top Rated Movies list. (An example related to this search is the NFL web scraping code discussed in lecture.)
rm(list=ls())
First load the “XML” package to give access to readHTMLTable()
and the “curl” package for access to curl()
.
if (!require("XML")) {
install.packages("XML") # do this once per lifetime
stopifnot(require("XML")) # do this once per session
}
if (!require("curl")) {
install.packages("curl") # do this once per lifetime
stopifnot(require("curl")) # do this once per session
}
At the bottom of the Internet Movie Database website there’s a link to the Top Rated Movies. At this page there’s a list of 250 movies, with a link to each movie. The first movie is The Shawshank Redmption.
With your browser on the “Top Rated Movies” page, you can do “right-click > view page source” (in Firefox or Chrome; in Safari, first do “Safari > Preferences > Advanced” and check “Show Develop menu in menu bar”; then do Develop > Show Page Source) to see the HTML code that creates the page. (You do not need to learn HTML for this homework.)
Search in the HTML source page for “Shawshank”, and you’ll see that it occurs twice, once in an <img .../>
tag and once in a <a.../a>
tag. Search for “Godfather”, and you’ll see that it occurs four times, twice for “The Godfather” and twice “The Godfather: Part II”. For each of these three <a...</a>
lines, the preceding line contains a link, relative to the main IMDB URL, to that movie’s page. Use grep() to figure out what small string is common to the 250 lines, like these three, that contain links to the top 250 movies.
Notice that the second line for “The Shawshank Redemption” includes the text “/title/tt0111161”. Pasting this onto “http://www.imdb.com” gives “http://www.imdb.com/title/tt0111161”, which is a link to the first movie’s page. Adding “/fullcredits” gives “http://www.imdb.com/title/tt0111161/fullcredits”, which is a link to the full cast and crew. Search this “fullcredits” page for “Produced” and you’ll see that “The Shawshank Redemption” was produced by “Liz Glotter”, “David V. Lester”, and “Niki Marvin”.
Write code that does the following:
readLines()
to read “http://www.imdb.com/chart/top” into a character string vector
?unique
to remove the duplicate)producers = list()
readHTMLTable(readLines(curl()))
to read all the tables into a list of dataframes; figure out which dataframe has the producers; you will need to replace “http” with “https” in each movie’s fullcredits URL (like “http://www.imdb.com/title/tt0111161/fullcredits”) to get readHTMLTable(readLines(curl()))
to workproducers[[title]] = ...
, where ...
is the vector of producers you foundunlist(producers)
to convert your list of title / producer vector pairs into a named vector of producers.
table()
to make a table of counts from this vector