read.csv
and readr::read_csv
write.csv
and readr::write_csv
readxl::read_excel
readRDS
and saveRDS
One can give a URL instead of a file name as an argument to functions such as read.csv
and read.delim
. Consider the data at http://www-personal.umich.edu/~bwest/classroom.csv
str(class <- read.csv("http://www-personal.umich.edu/~bwest/classroom.csv"))
## 'data.frame': 1190 obs. of 12 variables:
## $ sex : int 1 0 1 0 0 1 0 0 1 0 ...
## $ minority: int 1 1 1 1 1 1 1 1 1 1 ...
## $ mathkind: int 448 460 511 449 425 450 452 443 422 480 ...
## $ mathgain: int 32 109 56 83 53 65 51 66 88 -7 ...
## $ ses : num 0.46 -0.27 -0.03 -0.38 -0.03 0.76 -0.03 0.2 0.64 0.13 ...
## $ yearstea: num 1 1 1 2 2 2 2 2 2 2 ...
## $ mathknow: num NA NA NA -0.11 -0.11 -0.11 -0.11 -0.11 -0.11 -0.11 ...
## $ housepov: num 0.082 0.082 0.082 0.082 0.082 0.082 0.082 0.082 0.082 0.082 ...
## $ mathprep: num 2 2 2 3.25 3.25 3.25 3.25 3.25 3.25 3.25 ...
## $ classid : int 160 160 160 217 217 217 217 217 217 217 ...
## $ schoolid: int 1 1 1 1 1 1 1 1 1 1 ...
## $ childid : int 1 2 3 4 5 6 7 8 9 10 ...
Data sets like this use artificial numeric coding of variables that are in fact categorical. If we summarize these data
summary(class)
## sex minority mathkind mathgain
## Min. :0.0000 Min. :0.0000 Min. :290.0 Min. :-110.00
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:439.2 1st Qu.: 35.00
## Median :1.0000 Median :1.0000 Median :466.0 Median : 56.00
## Mean :0.5059 Mean :0.6773 Mean :466.7 Mean : 57.57
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:495.0 3rd Qu.: 77.00
## Max. :1.0000 Max. :1.0000 Max. :629.0 Max. : 253.00
##
## ses yearstea mathknow housepov
## Min. :-1.61000 Min. : 0.00 Min. :-2.5000 Min. :0.0120
## 1st Qu.:-0.49000 1st Qu.: 4.00 1st Qu.:-0.7200 1st Qu.:0.0850
## Median :-0.03000 Median :10.00 Median :-0.1300 Median :0.1270
## Mean :-0.01298 Mean :12.21 Mean : 0.0312 Mean :0.1782
## 3rd Qu.: 0.39750 3rd Qu.:20.00 3rd Qu.: 0.8500 3rd Qu.:0.2550
## Max. : 3.21000 Max. :40.00 Max. : 2.6100 Max. :0.5640
## NA's :109
## mathprep classid schoolid childid
## Min. :1.000 Min. : 1.0 Min. : 1.00 Min. : 1.0
## 1st Qu.:2.000 1st Qu.: 80.0 1st Qu.: 26.00 1st Qu.: 298.2
## Median :2.300 Median :157.0 Median : 54.00 Median : 595.5
## Mean :2.612 Mean :157.5 Mean : 52.94 Mean : 595.5
## 3rd Qu.:3.000 3rd Qu.:238.8 3rd Qu.: 79.00 3rd Qu.: 892.8
## Max. :6.000 Max. :312.0 Max. :107.00 Max. :1190.0
##
we get nonsensical numerical summaries of characteristics like sex
. We should change these variables to factors.
class <- within(class,{
sex <- factor(sex,labels=c("M","F"))
minority <- factor(minority,labels=c("N","Y"))
classid <- factor(classid)
schoolid <- factor(schoolid)
childid <- factor(childid)
})
str(class)
## 'data.frame': 1190 obs. of 12 variables:
## $ sex : Factor w/ 2 levels "M","F": 2 1 2 1 1 2 1 1 2 1 ...
## $ minority: Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
## $ mathkind: int 448 460 511 449 425 450 452 443 422 480 ...
## $ mathgain: int 32 109 56 83 53 65 51 66 88 -7 ...
## $ ses : num 0.46 -0.27 -0.03 -0.38 -0.03 0.76 -0.03 0.2 0.64 0.13 ...
## $ yearstea: num 1 1 1 2 2 2 2 2 2 2 ...
## $ mathknow: num NA NA NA -0.11 -0.11 -0.11 -0.11 -0.11 -0.11 -0.11 ...
## $ housepov: num 0.082 0.082 0.082 0.082 0.082 0.082 0.082 0.082 0.082 0.082 ...
## $ mathprep: num 2 2 2 3.25 3.25 3.25 3.25 3.25 3.25 3.25 ...
## $ classid : Factor w/ 312 levels "1","2","3","4",..: 160 160 160 217 217 217 217 217 217 217 ...
## $ schoolid: Factor w/ 107 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ childid : Factor w/ 1190 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
summary(class)
## sex minority mathkind mathgain ses
## M:588 N:384 Min. :290.0 Min. :-110.00 Min. :-1.61000
## F:602 Y:806 1st Qu.:439.2 1st Qu.: 35.00 1st Qu.:-0.49000
## Median :466.0 Median : 56.00 Median :-0.03000
## Mean :466.7 Mean : 57.57 Mean :-0.01298
## 3rd Qu.:495.0 3rd Qu.: 77.00 3rd Qu.: 0.39750
## Max. :629.0 Max. : 253.00 Max. : 3.21000
##
## yearstea mathknow housepov mathprep
## Min. : 0.00 Min. :-2.5000 Min. :0.0120 Min. :1.000
## 1st Qu.: 4.00 1st Qu.:-0.7200 1st Qu.:0.0850 1st Qu.:2.000
## Median :10.00 Median :-0.1300 Median :0.1270 Median :2.300
## Mean :12.21 Mean : 0.0312 Mean :0.1782 Mean :2.612
## 3rd Qu.:20.00 3rd Qu.: 0.8500 3rd Qu.:0.2550 3rd Qu.:3.000
## Max. :40.00 Max. : 2.6100 Max. :0.5640 Max. :6.000
## NA's :109
## classid schoolid childid
## 26 : 10 11 : 31 1 : 1
## 42 : 10 12 : 27 2 : 1
## 13 : 9 71 : 27 3 : 1
## 189 : 9 76 : 27 4 : 1
## 205 : 9 77 : 24 5 : 1
## 253 : 9 31 : 22 6 : 1
## (Other):1134 (Other):1032 (Other):1184
The childid
variable is redundant but there is no harm in retaining it.
For a categorical variable the summary is a frequency table. If the number of levels is large, the ones with the largest counts are listed first. Thus the largest number of students sampled from a single class is 10. To look at the distribution of the counts we can apply xtabs
twice.
xtabs(~xtabs(~classid, class))
## xtabs(~classid, class)
## 1 2 3 4 5 6 7 8 9 10
## 42 53 53 61 39 31 14 13 4 2
Out of the 312 classrooms, 42 have only one student in the study, whose purpose is to determine the effects of teacher training on student performance.
Many of the variables are characteristics of teachers and should be constant within a class. We should check that this is true.
str(classvars <- unique(subset(class,select=c("yearstea","mathknow","housepov","mathprep","classid","schoolid"))))
## 'data.frame': 312 obs. of 6 variables:
## $ yearstea: num 1 2 1 2 12.5 ...
## $ mathknow: num NA -0.11 -1.25 -0.72 NA 0.45 0.99 1.61 1.14 -1.05 ...
## $ housepov: num 0.082 0.082 0.082 0.082 0.082 0.086 0.086 0.086 0.086 0.365 ...
## $ mathprep: num 2 3.25 2.5 2.33 2.3 3.83 2.25 3 2.17 2 ...
## $ classid : Factor w/ 312 levels "1","2","3","4",..: 160 217 197 211 307 11 137 145 228 48 ...
## $ schoolid: Factor w/ 107 levels "1","2","3","4",..: 1 1 2 2 2 3 3 3 3 4 ...
summary(classvars)
## yearstea mathknow housepov mathprep
## Min. : 0.00 Min. :-2.50000 Min. :0.0120 Min. :1.000
## 1st Qu.: 4.00 1st Qu.:-0.76000 1st Qu.:0.0850 1st Qu.:2.000
## Median :10.00 Median :-0.19000 Median :0.1420 Median :2.300
## Mean :12.28 Mean :-0.08025 Mean :0.1908 Mean :2.577
## 3rd Qu.:20.00 3rd Qu.: 0.62000 3rd Qu.:0.2630 3rd Qu.:3.000
## Max. :40.00 Max. : 2.61000 Max. :0.5640 Max. :6.000
## NA's :27
## classid schoolid
## 1 : 1 11 : 9
## 2 : 1 12 : 5
## 3 : 1 15 : 5
## 4 : 1 17 : 5
## 5 : 1 33 : 5
## 6 : 1 46 : 5
## (Other):306 (Other):278
xtabs(~xtabs(~schoolid,classvars))
## xtabs(~schoolid, classvars)
## 1 2 3 4 5 9
## 13 34 26 21 12 1
The important information from the summary is that there are 312 rows in this dataframe, corresponding to the 312 classes. If any of the other variables were not constant within class we would have a greater number of rows.
We also see that the number of classes sampled per school is highly unbalanced and a large proportion of the schools have only one or two classes sampled.
A check on the school-specific variables shows they are consistent
str(schoolvars <- unique(subset(classvars,select=c("housepov","schoolid"))))
## 'data.frame': 107 obs. of 2 variables:
## $ housepov: num 0.082 0.082 0.086 0.365 0.511 0.044 0.148 0.085 0.537 0.346 ...
## $ schoolid: Factor w/ 107 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...