Clustering the graph with hierarchical clustering.

First, load the data and construct the KNN graph. This code is hidden in the .html document. To see this code, change the url of the current page by replacing “.html” with “.Rmd”

The next line of R code loads a file contains a function globalTransCASC that (1) takes the graph sum like CASC, then (2) weights each edge by the number of triangles that it joins, and finally (3) runs single linkage hierarchical clustering on the resulting similarity matrix.

source(url("http://pages.stat.wisc.edu/~karlrohe/netsci/code/globalTransitivityFunctions.R"))

Single linkage is often stated as taking pairwise distances between points. Because a graph encodes “similarities”, the globalTrans algorithm instead works with similarities (akin to the inverse of distance). This also yields a massive computational advantage because it is equivalent to computing a maximum spanning tree (same base algorithm as a minimum spanning tree). Like an eigendecomposition, computing the mst is a super well studied algorithmic problem. There are two popular algorithms for computing the mst of a weighted undirected graph, Prim’s and Kruskal’s ( wikipedia is also suggesting another one that I am not familiar with ) There is native code in igraph that runs Prim’s and is so so fast. In my experience with sparse adjacency matrices, sparse multiplication \(AA\) takes much much longer computing the mst. If you can ever relate your problem to mst’s, do it.

The “globalTrans” algorithm was proposed and studied in this paper. The original form does not have the “graph sum” part; that is a “hack” that has been incorporated from the CASC paper. For computing a fine partition of the graph, globalTrans is much faster than an eigen decomposition (even when done with a sparse solver), because eigs needs to find hundreds of leading eigenvectors… this take a lot of work. “globalTrans” is an aglomerative or “bottom up” approach to clustering. Because it relies on minumum spanning trees, it provides a hierarchy of clusters!

gt = globalTransCASC(A,Aknn)
## Warning: closing unused connection 6
## (http://pages.stat.wisc.edu/~karlrohe/netsci/code/globalTransitivityFunctions.R)
diagnose(gt,minSize = 10)

##    36.50664% 
## 0.0005646919
library(igraph)
cutval= diagnose(gt, cutSeq = seq(0.0005, .0007,len = 30))

clust = getClusterEst(gt, cutval)

This algorithm doesn’t mind returning lots of small clusters. This can be annoying. Roughly 10% of the nodes are in clusters that contain 5 or fewer nodes.

tab = table(clust)
sum(tab[tab<6])/length(V(gt))
## [1] 0.09175162

Remove clusters smaller than 20. Then, find zipcode locations and plot.

newClust = chopSmallClusters(clust,minSize = 10)
names(newClust) = names(clust)
newClust = newClust[complete.cases(newClust)]

loc = zipcode[names(newClust), c("longitude", "latitude"), with  =F]
loc = as.matrix(loc)
mean(complete.cases(loc))  # all complete
## [1] 1

Given that we are making lots of “tessellation” maps, let’s make this a function.

makeMap = function(loc, zhat, dots = '.'){
  points(rnorm(loc[,1],loc[,1],.1),rnorm(loc[,2],loc[,2],.1), col=as.factor(zhat), pch  =dots)
  
  for(i in unique(zhat)){
    inclust = which(zhat ==i)
    Plot_ConvexHull(xcoord = loc[inclust,1], ycoord = loc[inclust,2],lcolor = "black", thick = 2)
    }
  }

map('state', add = F) 
makeMap(loc, newClust)

These clusters are more geographically unbalanced. The number of nodes is also unbalanced.

plot(as.vector(sort(table(newClust))))

length(unique(newClust))
## [1] 322

While it fixes the crazy geography clusters… the clusters do not align with my expectations of partitions of the united states. What do you think?

map('county', 'wi')
makeMap(loc,newClust, dots  = 19)

map('county', c('wash','ore'))
makeMap(loc,newClust, dots  = 19)

map('county', 'cali')
makeMap(loc,newClust, dots  = 19)

NorthEast = c("Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont", "New Jersey", "New York", "Pennsylvania", "Mary", "Del")
map('county', NorthEast)
makeMap(loc,newClust, dots  = '')

map('county', c('new york', 'penn'))
makeMap(loc,newClust, dots  = 21)

map('county', c('verm', 'new ham'))
makeMap(loc,newClust, dots  = 21)