This document is a discussion around your attempt to read some bad data. You should have already tried to do that.

This document does not tell you how to read the data in. So, you should go do it first.

If you haven’t tried to read in the data, then reading ahead is like really unsatisfying cheating. I promise. It sucks. Seriously, go try getting that data into R. Try for at least 5 minutes.

Reflections

  1. What is your stance for reading in bad data?
  2. What errors did you find? What did you do when you found those errors?
  3. Did you “scroll through” the “raw data”? How?
  4. Did you download the file in your web browser or with R?

Troubleshooting

What are some key steps/elements/suggestions for importing difficult datasets?

How can you examine raw data files?

This section discusses three ways to examine the raw text file.

  1. simple text editor (for small files)
  2. readr::read_lines
  3. using the terminal
with text editor

Sometimes, when I have trouble loading the data, I want to look at the data as a text file… to see what is going on.

When the data is small, you can just “open with” a simple text editor…

right_click

readr::read_lines

For large data sets, that can be very very slow. In these cases, you just want to see a few lines of the data.

download.file("http://pages.stat.wisc.edu/~karlrohe/ds479/data/badRead.csv",
              "badData.csv")
read_lines("badData.csv", n_max=10)
read_lines("badData.csv")

# the old skool way is to form a file connection and use base::readLines:
pointer_to_file <- file("badData.csv")  
pointer_to_file  # this is a weird thing!
readLines(pointer_to_file,n = 10)
close(pointer_to_file)  # remember to close the connection!
Using the terminal

Have you heard of the terminal (or bash or unix)?

In Rstudio, you can access the terminal. In the same region of the screen where you execute R commands (i.e. the Console), there is another tab called Terminal.

right_click

Click on Terminal. You should see this:

right_click

Then, type pwd and hit return. You should see the file path of your “present working directory”.

Then, type ls and hit return. You should see a list of all the files in that directory. If you ran the code above (in R) to download badData.csv, then you should see that file listed there.

Type more badData.csv and hit return. It should fill the terminal screen with lines from the file. You can press return to scroll down one line and the space bar to get a full set of new lines. type q to escape at any time. To view specific lines… sed -n '1001p' badData.csv will give you line 1001 Do this. Then, guess how to edit the code to look at line 1002. There is an important difference between these two lines!

If you’ve never used the terminal before, congratulations! You did it!

Optional: try head badData.csv and tail badData.csv. Or, try tail bad and before you hit return, hit tab. It should complete the file name. If not, hit tab twice and you will see all filenames in your present working directory that start with bad.

Karl’s work flow for downloading files…

I find readr::read_csv and data.table::fread to both be fast and pretty good at handling mildly annoying and messed up data. I often start with the first one. If you use library(data.table) be aware that it will load the data in a funky form and you need to convert to tibble with as_tibble.

Then, the hardest part is validation. How do you know that it read correctly? At the very least, ensure that these quantities are correct (assuming it is possible to check these things):
- number of rows and columns - missing values - variable types - descriptive statistics / outliers / NA’s represented as 999 or -1

Finally, if all else fails, write a function to read data line by line. examine specific features to try and find error.