This document is a discussion around your attempt to read some bad data. You should have already tried to do that.
This document does not tell you how to read the data in. So, you should go do it first.
If you haven’t tried to read in the data, then reading ahead is like really unsatisfying cheating. I promise. It sucks. Seriously, go try getting that data into R. Try for at least 5 minutes.
What are some key steps/elements/suggestions for importing difficult datasets?
This section discusses three ways to examine the raw text file.
readr::read_lines
Sometimes, when I have trouble loading the data, I want to look at the data as a text file… to see what is going on.
When the data is small, you can just “open with” a simple text editor…
For large data sets, that can be very very slow. In these cases, you just want to see a few lines of the data.
download.file("http://pages.stat.wisc.edu/~karlrohe/ds479/data/badRead.csv",
"badData.csv")
read_lines("badData.csv", n_max=10)
read_lines("badData.csv")
# the old skool way is to form a file connection and use base::readLines:
pointer_to_file <- file("badData.csv")
pointer_to_file # this is a weird thing!
readLines(pointer_to_file,n = 10)
close(pointer_to_file) # remember to close the connection!
Have you heard of the terminal (or bash or unix)?
In Rstudio, you can access the terminal. In the same region of the screen where you execute R commands (i.e. the Console), there is another tab called Terminal.
Click on Terminal. You should see this:
Then, type pwd
and hit return. You should see the file path of your “present working directory”.
Then, type ls
and hit return. You should see a list of all the files in that directory. If you ran the code above (in R) to download badData.csv
, then you should see that file listed there.
Type more badData.csv
and hit return. It should fill the terminal screen with lines from the file. You can press return to scroll down one line and the space bar to get a full set of new lines. type q
to escape at any time. To view specific lines… sed -n '1001p' badData.csv
will give you line 1001 Do this. Then, guess how to edit the code to look at line 1002. There is an important difference between these two lines!
If you’ve never used the terminal before, congratulations! You did it!
Optional: try head badData.csv
and tail badData.csv
. Or, try tail bad
and before you hit return, hit tab. It should complete the file name. If not, hit tab twice and you will see all filenames in your present working directory that start with bad
.
I find readr::read_csv
and data.table::fread
to both be fast and pretty good at handling mildly annoying and messed up data. I often start with the first one. If you use library(data.table)
be aware that it will load the data in a funky form and you need to convert to tibble
with as_tibble
.
Then, the hardest part is validation. How do you know that it read correctly? At the very least, ensure that these quantities are correct (assuming it is possible to check these things):
- number of rows and columns - missing values - variable types - descriptive statistics / outliers / NA’s represented as 999 or -1
Finally, if all else fails, write a function to read data line by line. examine specific features to try and find error.