Reading data over the Internet

One can give a URL instead of a file name as an argument to functions such as read.csv and read.delim. Consider the data at http://www-personal.umich.edu/~bwest/classroom.csv

str(class <- read.csv("http://www-personal.umich.edu/~bwest/classroom.csv"))
## 'data.frame':    1190 obs. of  12 variables:
##  $ sex     : int  1 0 1 0 0 1 0 0 1 0 ...
##  $ minority: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ mathkind: int  448 460 511 449 425 450 452 443 422 480 ...
##  $ mathgain: int  32 109 56 83 53 65 51 66 88 -7 ...
##  $ ses     : num  0.46 -0.27 -0.03 -0.38 -0.03 0.76 -0.03 0.2 0.64 0.13 ...
##  $ yearstea: num  1 1 1 2 2 2 2 2 2 2 ...
##  $ mathknow: num  NA NA NA -0.11 -0.11 -0.11 -0.11 -0.11 -0.11 -0.11 ...
##  $ housepov: num  0.082 0.082 0.082 0.082 0.082 0.082 0.082 0.082 0.082 0.082 ...
##  $ mathprep: num  2 2 2 3.25 3.25 3.25 3.25 3.25 3.25 3.25 ...
##  $ classid : int  160 160 160 217 217 217 217 217 217 217 ...
##  $ schoolid: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ childid : int  1 2 3 4 5 6 7 8 9 10 ...

Data sets like this use artificial numeric coding of variables that are in fact categorical. If we summarize these data

summary(class)
##       sex            minority         mathkind        mathgain      
##  Min.   :0.0000   Min.   :0.0000   Min.   :290.0   Min.   :-110.00  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:439.2   1st Qu.:  35.00  
##  Median :1.0000   Median :1.0000   Median :466.0   Median :  56.00  
##  Mean   :0.5059   Mean   :0.6773   Mean   :466.7   Mean   :  57.57  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:495.0   3rd Qu.:  77.00  
##  Max.   :1.0000   Max.   :1.0000   Max.   :629.0   Max.   : 253.00  
##                                                                     
##       ses              yearstea        mathknow          housepov     
##  Min.   :-1.61000   Min.   : 0.00   Min.   :-2.5000   Min.   :0.0120  
##  1st Qu.:-0.49000   1st Qu.: 4.00   1st Qu.:-0.7200   1st Qu.:0.0850  
##  Median :-0.03000   Median :10.00   Median :-0.1300   Median :0.1270  
##  Mean   :-0.01298   Mean   :12.21   Mean   : 0.0312   Mean   :0.1782  
##  3rd Qu.: 0.39750   3rd Qu.:20.00   3rd Qu.: 0.8500   3rd Qu.:0.2550  
##  Max.   : 3.21000   Max.   :40.00   Max.   : 2.6100   Max.   :0.5640  
##                                     NA's   :109                       
##     mathprep        classid         schoolid         childid      
##  Min.   :1.000   Min.   :  1.0   Min.   :  1.00   Min.   :   1.0  
##  1st Qu.:2.000   1st Qu.: 80.0   1st Qu.: 26.00   1st Qu.: 298.2  
##  Median :2.300   Median :157.0   Median : 54.00   Median : 595.5  
##  Mean   :2.612   Mean   :157.5   Mean   : 52.94   Mean   : 595.5  
##  3rd Qu.:3.000   3rd Qu.:238.8   3rd Qu.: 79.00   3rd Qu.: 892.8  
##  Max.   :6.000   Max.   :312.0   Max.   :107.00   Max.   :1190.0  
## 

we get nonsensical numerical summaries of characteristics like sex. We should change these variables to factors.

class <- within(class,{
  sex <- factor(sex,labels=c("M","F"))
  minority <- factor(minority,labels=c("N","Y"))
  classid <- factor(classid)
  schoolid <- factor(schoolid)
  childid <- factor(childid)
})
str(class)
## 'data.frame':    1190 obs. of  12 variables:
##  $ sex     : Factor w/ 2 levels "M","F": 2 1 2 1 1 2 1 1 2 1 ...
##  $ minority: Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mathkind: int  448 460 511 449 425 450 452 443 422 480 ...
##  $ mathgain: int  32 109 56 83 53 65 51 66 88 -7 ...
##  $ ses     : num  0.46 -0.27 -0.03 -0.38 -0.03 0.76 -0.03 0.2 0.64 0.13 ...
##  $ yearstea: num  1 1 1 2 2 2 2 2 2 2 ...
##  $ mathknow: num  NA NA NA -0.11 -0.11 -0.11 -0.11 -0.11 -0.11 -0.11 ...
##  $ housepov: num  0.082 0.082 0.082 0.082 0.082 0.082 0.082 0.082 0.082 0.082 ...
##  $ mathprep: num  2 2 2 3.25 3.25 3.25 3.25 3.25 3.25 3.25 ...
##  $ classid : Factor w/ 312 levels "1","2","3","4",..: 160 160 160 217 217 217 217 217 217 217 ...
##  $ schoolid: Factor w/ 107 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ childid : Factor w/ 1190 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
summary(class)
##  sex     minority    mathkind        mathgain            ses          
##  M:588   N:384    Min.   :290.0   Min.   :-110.00   Min.   :-1.61000  
##  F:602   Y:806    1st Qu.:439.2   1st Qu.:  35.00   1st Qu.:-0.49000  
##                   Median :466.0   Median :  56.00   Median :-0.03000  
##                   Mean   :466.7   Mean   :  57.57   Mean   :-0.01298  
##                   3rd Qu.:495.0   3rd Qu.:  77.00   3rd Qu.: 0.39750  
##                   Max.   :629.0   Max.   : 253.00   Max.   : 3.21000  
##                                                                       
##     yearstea        mathknow          housepov         mathprep    
##  Min.   : 0.00   Min.   :-2.5000   Min.   :0.0120   Min.   :1.000  
##  1st Qu.: 4.00   1st Qu.:-0.7200   1st Qu.:0.0850   1st Qu.:2.000  
##  Median :10.00   Median :-0.1300   Median :0.1270   Median :2.300  
##  Mean   :12.21   Mean   : 0.0312   Mean   :0.1782   Mean   :2.612  
##  3rd Qu.:20.00   3rd Qu.: 0.8500   3rd Qu.:0.2550   3rd Qu.:3.000  
##  Max.   :40.00   Max.   : 2.6100   Max.   :0.5640   Max.   :6.000  
##                  NA's   :109                                       
##     classid        schoolid       childid    
##  26     :  10   11     :  31   1      :   1  
##  42     :  10   12     :  27   2      :   1  
##  13     :   9   71     :  27   3      :   1  
##  189    :   9   76     :  27   4      :   1  
##  205    :   9   77     :  24   5      :   1  
##  253    :   9   31     :  22   6      :   1  
##  (Other):1134   (Other):1032   (Other):1184

The childid variable is redundant but there is no harm in retaining it.

For a categorical variable the summary is a frequency table. If the number of levels is large, the ones with the largest counts are listed first. Thus the largest number of students sampled from a single class is 10. To look at the distribution of the counts we can apply xtabs twice.

xtabs(~xtabs(~classid, class))
## xtabs(~classid, class)
##  1  2  3  4  5  6  7  8  9 10 
## 42 53 53 61 39 31 14 13  4  2

Out of the 312 classrooms, 42 have only one student in the study, whose purpose is to determine the effects of teacher training on student performance.

Class-specific and school-specific variables

Many of the variables are characteristics of teachers and should be constant within a class. We should check that this is true.

str(classvars <- unique(subset(class,select=c("yearstea","mathknow","housepov","mathprep","classid","schoolid"))))
## 'data.frame':    312 obs. of  6 variables:
##  $ yearstea: num  1 2 1 2 12.5 ...
##  $ mathknow: num  NA -0.11 -1.25 -0.72 NA 0.45 0.99 1.61 1.14 -1.05 ...
##  $ housepov: num  0.082 0.082 0.082 0.082 0.082 0.086 0.086 0.086 0.086 0.365 ...
##  $ mathprep: num  2 3.25 2.5 2.33 2.3 3.83 2.25 3 2.17 2 ...
##  $ classid : Factor w/ 312 levels "1","2","3","4",..: 160 217 197 211 307 11 137 145 228 48 ...
##  $ schoolid: Factor w/ 107 levels "1","2","3","4",..: 1 1 2 2 2 3 3 3 3 4 ...
summary(classvars)
##     yearstea        mathknow           housepov         mathprep    
##  Min.   : 0.00   Min.   :-2.50000   Min.   :0.0120   Min.   :1.000  
##  1st Qu.: 4.00   1st Qu.:-0.76000   1st Qu.:0.0850   1st Qu.:2.000  
##  Median :10.00   Median :-0.19000   Median :0.1420   Median :2.300  
##  Mean   :12.28   Mean   :-0.08025   Mean   :0.1908   Mean   :2.577  
##  3rd Qu.:20.00   3rd Qu.: 0.62000   3rd Qu.:0.2630   3rd Qu.:3.000  
##  Max.   :40.00   Max.   : 2.61000   Max.   :0.5640   Max.   :6.000  
##                  NA's   :27                                         
##     classid       schoolid  
##  1      :  1   11     :  9  
##  2      :  1   12     :  5  
##  3      :  1   15     :  5  
##  4      :  1   17     :  5  
##  5      :  1   33     :  5  
##  6      :  1   46     :  5  
##  (Other):306   (Other):278
xtabs(~xtabs(~schoolid,classvars))
## xtabs(~schoolid, classvars)
##  1  2  3  4  5  9 
## 13 34 26 21 12  1

The important information from the summary is that there are 312 rows in this dataframe, corresponding to the 312 classes. If any of the other variables were not constant within class we would have a greater number of rows.

We also see that the number of classes sampled per school is highly unbalanced and a large proportion of the schools have only one or two classes sampled.

A check on the school-specific variables shows they are consistent

str(schoolvars <- unique(subset(classvars,select=c("housepov","schoolid"))))
## 'data.frame':    107 obs. of  2 variables:
##  $ housepov: num  0.082 0.082 0.086 0.365 0.511 0.044 0.148 0.085 0.537 0.346 ...
##  $ schoolid: Factor w/ 107 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...