In the first uses of the SBM, the block memberships \(z\) were known a priori. Later, \(z\) was estimated in the “a posteriori SBM”. In my reading of the literature, Newman and Girvan really pushed this idea further by saying that the “blocks” are modules. You can think of the \(B\) matrix as an adjacency matrix for a “modular” network; here, each block becomes a “supernode”.
It is reasonable to model \(B\) using known covariates as \(z\). Then consider \(\hat B\) as a weighted adjacency matrix. In fact, we have done this twice in class! In the first, we used billing zip codes. In the second, we used “primary specialty” of physicians. Perhaps, this is intuitively similar to vertex contractions in graph theory. However, I suspect that the motivations are vastly different. By “collapsing” on known blocks (e.g. zip code or specialty), we vastly simplify the network, allowing us to highlight salient features (e.g. geography or macro hospital referral patterns).
Sometimes, the modular network is sufficiently small that visualizations become tractable.
rm(list = ls())
library(Matrix)
library(igraph)
load(url("http://pages.stat.wisc.edu/~karlrohe/netsci/data/specialty.RData"))
g = graph.adjacency(A, mode = "directed", weighted = T)
plot(g, vertex.label = NA)
core = graph.coreness(g)
sum(E(g)$weight>3)
## [1] 286
# E(g)$weight = as.numeric(E(g)$weight>3)
g = delete.edges(g, which(E(g)$weight <3))
plot(g, vertex.label = NA)
plot(induced.subgraph(g,vids = core>30),vertex.label=NA)
What a hairball! Clearly, we need a more refined analysis (or better plotting software).
Note: in this graph \(\hat B\) is a type of weighted adjacency matrix. Because \(A\) is asymmetric, so is \(\hat B\). So, this modular network is directed.
We have already modularized the network, but we can modularize it again! This time, we can try to estimate a partition \(z\) of specialty types.
Recall that with a directed network, we can use the standard spectral clustering algorithm applied to \(A + A^T\).
# this function presumes that A is small and dense. As such, it computes the full svd.
spectralClustering = function(A, k = c()){
D = Diagonal(n = nrow(A), 1/sqrt(degree(g) + mean(degree(g))))
L = D%*%A%*%D
if(length(k) ==1) X = svd(L)$u[,1:k]
if(length(k)==0){ # this will not work in markdown, but you can try.
s = svd(L)
plot(s$d[-1])
print("type desired k (remember to add one)")
k = scan()
X = s$u[,1:k]
}
X = t(apply(X, 1, function(x) return(x/sqrt(sum(x^2)))))
return(kmeans(X, k, nstart = 20))
}
As = A + t(A)
spec = spectralClustering(As, 8)
table(spec$clust) # a bit unbalanced!
##
## 1 2 3 4 5 6 7 8
## 10 3 9 4 7 9 7 25
Note: What is a good way to choose \(k\)? Look at the scree plot! From wikipedia, “Compute the eigenvalues for the correlation matrix and plot the values from largest to smallest. Examine the graph to determine the last substantial drop in the magnitude of eigenvalues. The number of plotted points before the last drop is the number of factors to include in the model. This method has been criticized because of its subjective nature (i.e., there is no clear objective definition of what constitutes a substantial drop).[17] As this procedure is subjective, Courtney (2013) does not recommend it.” Here is the scree plot that provided \(k=8\).
D = Diagonal(n = nrow(A), 1/sqrt(degree(g) + mean(degree(g))))
s= svd(D%*%A%*%D)
plot(s$d)
plot(s$d[-1])
So, now we found a partition into \(k=8\) sets. We can interpret the resulting partition with a \(k = 8\) node graph! To do this, we need to estimate \(B\) for the given \(z\). For a given \(z\), estimating \(B\) is straightforward, \[\hat B_{uv} = sum_{i,j: z(i) = u, z(j) = v}A_{ij}.\] This can be done quickly with some linear algebra:
fitSBM = function(A, z, y = c()){
# takes label vector z (and y for co-blockmodel) as input! Does not estimate partition.
# computes the empirical probability of an edge from block u to block v
Zn = scale(model.matrix(~as.factor(z)-1),center=F, scale = T)/sqrt(length(z)-1)
if(length(y)==0) return(t(Zn)%*%A%*%Zn)
if(length(y)>0){
Yn = scale(model.matrix(~as.factor(y)-1),center=F, scale = T)/sqrt(length(y)-1)
return(t(Yn)%*%A%*%Zn)
}
}
Bhat = as.matrix(fitSBM(A, spec$clust)) # convert to matrix from a Matrix.
round(Bhat,1)
## as.factor(z)1 as.factor(z)2 as.factor(z)3 as.factor(z)4
## as.factor(z)1 19.9 0.0 3.9 2.8
## as.factor(z)2 0.0 3.0 0.2 0.0
## as.factor(z)3 3.7 0.2 8.6 1.3
## as.factor(z)4 1.8 0.0 1.1 7.9
## as.factor(z)5 26.9 1.2 15.8 9.2
## as.factor(z)6 3.6 0.9 2.2 1.1
## as.factor(z)7 4.6 0.0 3.4 1.0
## as.factor(z)8 1.0 0.1 0.4 0.3
## as.factor(z)5 as.factor(z)6 as.factor(z)7 as.factor(z)8
## as.factor(z)1 26.5 4.8 6.5 1.0
## as.factor(z)2 0.5 0.3 0.3 0.0
## as.factor(z)3 15.3 2.1 3.2 0.8
## as.factor(z)4 8.7 0.4 1.3 0.2
## as.factor(z)5 29.5 12.6 11.9 9.5
## as.factor(z)6 12.2 6.2 2.2 0.5
## as.factor(z)7 10.8 2.0 7.6 0.4
## as.factor(z)8 9.1 0.5 0.9 0.6
Huh?! How come it has values greater than one?? Remember that the specialty graph \(A\) is weighted. So, “\(B\)”" contains the “expected weights”. As such, this can contain values greater than one! If we were writing down likelihoods and fitting MLE’s, we would have to be much more careful about all of this. A great advantage of spectral techniques is that you can blindly apply them; of course, a downside is that you might forget and the code won’t throw an error.
That table is difficult to read. Perhaps we could display as an image…
image(Bhat, col = grey(seq(1,0,len=100)))
Because of some stupid jpeg convention, R draws matrices not how we envision them! Be careful!!!
imageMat = function(M){
n = nrow(M)
return(t(M[n:1,]))
}
image(imageMat(Bhat), col = grey(seq(1,0,len=100)))
image(imageMat((Bhat - t(Bhat))^2), col = grey(seq(1,0,len=100))) # inspect the "reciprocity"
Even with 8 node graphs, it is really difficult to simply visualize the adjacency matrix!
gs = graph.adjacency(Bhat,mode = "directed", weighted=T)
loc = layout.fruchterman.reingold(gs)
plot(gs, layout= loc, edge.width=(E(gs)$weight), vertex.label = 1:8, edge.curved=T)
plot(gs, layout= loc, edge.width=sqrt(E(gs)$weight), vertex.label = table(spec$clust), edge.curved=T)
This is clearly a dense graph. It is a bit difficult to interpret. Best to focus on a node and move out from there. The second block is highly connected! Guess which specialties are in this block!
nm = colnames(A)
g = graph.adjacency(A, mode = "directed", weighted = T)
clustlist = list()
for(j in 1:8) clustlist[[j]]= c(
round(mean(degree(g,v = spec$clust == j,mode = "in"))),
round(mean(degree(g,v = spec$clust == j,mode = "out"))),
nm[spec$clust == j])
clustlist
## [[1]]
## [1] "33" "33"
## [3] "EMERGENCY MEDICINE" "VASCULAR SURGERY"
## [5] "NEUROLOGY" "NEPHROLOGY"
## [7] "INTERVENTIONAL RADIOLOGY" "INFECTIOUS DISEASE"
## [9] "INTERVENTIONAL CARDIOLOGY" "CARDIAC ELECTROPHYSIOLOGY"
## [11] "GENERAL SURGERY" "CRITICAL CARE (INTENSIVISTS)"
##
## [[2]]
## [1] "8" "8"
## [3] "PHYSICAL THERAPY" "OCCUPATIONAL THERAPY"
## [5] "SPEECH LANGUAGE PATHOLOGIST"
##
## [[3]]
## [1] "22"
## [2] "23"
## [3] "OPHTHALMOLOGY"
## [4] "PODIATRY"
## [5] "UROLOGY"
## [6] "DERMATOLOGY"
## [7] "OPTOMETRY"
## [8] "ENDOCRINOLOGY"
## [9] "GENERAL PRACTICE"
## [10] "CHIROPRACTIC"
## [11] "REGISTERED DIETITIAN OR NUTRITION PROFESSIONAL"
##
## [[4]]
## [1] "20" "20" "HEMATOLOGY/ONCOLOGY"
## [4] "RADIATION ONCOLOGY" "MEDICAL ONCOLOGY" "HEMATOLOGY"
##
## [[5]]
## [1] "48"
## [2] "48"
## [3] "INTERNAL MEDICINE"
## [4] "CARDIOVASCULAR DISEASE (CARDIOLOGY)"
## [5] "DIAGNOSTIC RADIOLOGY"
## [6] "PATHOLOGY"
## [7] "FAMILY PRACTICE"
## [8] "PULMONARY DISEASE"
## [9] "AUDIOLOGIST"
##
## [[6]]
## [1] "21"
## [2] "19"
## [3] "GASTROENTEROLOGY"
## [4] "CERTIFIED REGISTERED NURSE ANESTHETIST"
## [5] "ORTHOPEDIC SURGERY"
## [6] "ANESTHESIOLOGY"
## [7] "HAND SURGERY"
## [8] "INTERVENTIONAL PAIN MANAGEMENT"
## [9] "ANESTHESIOLOGY ASSISTANT"
## [10] "PAIN MANAGEMENT"
## [11] "PREVENTATIVE MEDICINE"
##
## [[7]]
## [1] "24"
## [2] "22"
## [3] "NURSE PRACTITIONER"
## [4] "PHYSICAL MEDICINE AND REHABILITATION"
## [5] "PSYCHIATRY"
## [6] "GERIATRIC MEDICINE"
## [7] "CLINICAL PSYCHOLOGIST"
## [8] "ADDICTION MEDICINE"
## [9] "CLINICAL SOCIAL WORKER"
##
## [[8]]
## [1] "10"
## [2] "10"
## [3] "CARDIAC SURGERY"
## [4] "THORACIC SURGERY"
## [5] "RHEUMATOLOGY"
## [6] "ALLERGY/IMMUNOLOGY"
## [7] "OSTEOPATHIC MANIPULATIVE MEDICINE"
## [8] "NEUROSURGERY"
## [9] "HOSPICE/PALLIATIVE CARE"
## [10] "NUCLEAR MEDICINE"
## [11] "OTOLARYNGOLOGY"
## [12] "UNDEFINED PHYSICIAN TYPE (SPECIFY)"
## [13] "PERIPHERAL VASCULAR DISEASE"
## [14] "GYNECOLOGICAL ONCOLOGY"
## [15] "SPORTS MEDICINE"
## [16] "SLEEP LABORATORY/MEDICINE"
## [17] "CLINICAL NURSE SPECIALIST"
## [18] "PLASTIC AND RECONSTRUCTIVE SURGERY"
## [19] "OBSTETRICS/GYNECOLOGY"
## [20] "COLORECTAL SURGERY (PROCTOLOGY)"
## [21] "NEUROPSYCHIATRY"
## [22] "PHYSICIAN ASSISTANT"
## [23] "PEDIATRIC MEDICINE"
## [24] "SURGICAL ONCOLOGY"
## [25] "GERIATRIC PSYCHIATRY"
## [26] "CERTIFIED NURSE MIDWIFE"
## [27] "ORAL SURGERY (DENTIST ONLY)"
Before reading the next steps, you might read the lecture notes on the eigendecomposition and the svd.
The algorithm “di-sim” explored here is described in this paper.
A key idea in the di-sim algorithm is that there are two partitions of the nodes. One for sending and one for recieving. Each node is assigned to one sending-block and one receiving-block. this is encapsulated in the Stochastic Co-Blockmodel, where there are two partitions \(Y\) and \(Z\) and \(E(A) = YBZ^T\). Before, the SBM provides a way of computing super nodes in a “collapsed” network. The ScBM summarizes a directed graph with a bipartite graph ([click here for more about bipartite](. Each node is assigned to a “sending” super node and a “receiving” super node. Let’s illustrate this with the specialty graph.
rs = rowSums(A); cs = colSums(A)
Drow = Diagonal(n = nrow(A), 1/sqrt(rs + mean(rs)))
Dcol = Diagonal(n = ncol(A), 1/sqrt(cs + mean(cs)))
L = Drow%*%A%*%Dcol
s = svd(L)
plot(s$d[-1])
k = 8
u = s$u[,1:k]; v = s$v[,1:k]
n = nrow(A)
un = t(apply(u,1, function(x) return(x/sqrt(sum(x^2)+1/n))))
vn = t(apply(v,1, function(x) return(x/sqrt(sum(x^2)+1/n))))
km = kmeans(rbind(un,vn), centers = k,nstart = 100)
uclust = km$clust[1:n]
vclust = km$clust[-(1:n)]
table(uclust,vclust)
## vclust
## uclust 1 2 3 4 5 6 7 8
## 1 9 0 0 0 0 0 0 0
## 2 0 4 0 1 0 0 0 0
## 3 0 0 0 0 0 4 0 0
## 4 0 0 5 10 0 0 0 0
## 5 0 0 0 0 4 0 0 0
## 6 1 1 18 2 0 0 0 2
## 7 1 0 1 0 0 0 5 0
## 8 0 1 0 0 0 0 0 5
If “sending stochastic equivalence” implied “receiving stochastic equivalence”, then this matrix would have a strong diagonal. Let’s visualize it with a bi-partite graph.
Bhat = fitSBM(A, y = uclust, z = vclust)
Ab = diag(rep(0, 2*k))
for(i in 1:k) for(j in 1:k) Ab[i,k+j] = Bhat[i,j]
Ab = Ab + t(Ab)
gb = graph.adjacency(Ab, weighted = T)
V(gb)$type = c(rep(T,8), rep(F,8))
V(gb)$name = c(paste("s",1:8, sep=""),paste("r",1:8, sep=""))
plot(gb, layout= layout.bipartite, edge.width=(E(gb)$weight)^.7, edge.arrow.size = 0)
What are the specialties in these blocks?
nm = colnames(A)
sendlist = list()
reclist = list()
for(j in 1:8){
sendlist[[j]] = nm[uclust == j]
reclist[[j]] = nm[vclust == j]
}
sendlist
## [[1]]
## [1] "EMERGENCY MEDICINE" "VASCULAR SURGERY"
## [3] "NEUROLOGY" "NEPHROLOGY"
## [5] "INTERVENTIONAL RADIOLOGY" "PULMONARY DISEASE"
## [7] "INFECTIOUS DISEASE" "INTERVENTIONAL CARDIOLOGY"
## [9] "CRITICAL CARE (INTENSIVISTS)"
##
## [[2]]
## [1] "NURSE PRACTITIONER"
## [2] "PHYSICAL MEDICINE AND REHABILITATION"
## [3] "PSYCHIATRY"
## [4] "CLINICAL PSYCHOLOGIST"
## [5] "CLINICAL SOCIAL WORKER"
##
## [[3]]
## [1] "INTERNAL MEDICINE"
## [2] "CARDIOVASCULAR DISEASE (CARDIOLOGY)"
## [3] "DIAGNOSTIC RADIOLOGY"
## [4] "AUDIOLOGIST"
##
## [[4]]
## [1] "ORTHOPEDIC SURGERY"
## [2] "PHYSICAL THERAPY"
## [3] "PERIPHERAL VASCULAR DISEASE"
## [4] "GYNECOLOGICAL ONCOLOGY"
## [5] "SPORTS MEDICINE"
## [6] "SLEEP LABORATORY/MEDICINE"
## [7] "ADDICTION MEDICINE"
## [8] "OCCUPATIONAL THERAPY"
## [9] "NEUROPSYCHIATRY"
## [10] "REGISTERED DIETITIAN OR NUTRITION PROFESSIONAL"
## [11] "GERIATRIC PSYCHIATRY"
## [12] "SPEECH LANGUAGE PATHOLOGIST"
## [13] "CERTIFIED NURSE MIDWIFE"
## [14] "PREVENTATIVE MEDICINE"
## [15] "ORAL SURGERY (DENTIST ONLY)"
##
## [[5]]
## [1] "HEMATOLOGY/ONCOLOGY" "RADIATION ONCOLOGY" "MEDICAL ONCOLOGY"
## [4] "HEMATOLOGY"
##
## [[6]]
## [1] "CARDIAC SURGERY"
## [2] "ENDOCRINOLOGY"
## [3] "THORACIC SURGERY"
## [4] "RHEUMATOLOGY"
## [5] "ALLERGY/IMMUNOLOGY"
## [6] "OSTEOPATHIC MANIPULATIVE MEDICINE"
## [7] "GENERAL PRACTICE"
## [8] "NEUROSURGERY"
## [9] "HOSPICE/PALLIATIVE CARE"
## [10] "CHIROPRACTIC"
## [11] "NUCLEAR MEDICINE"
## [12] "OTOLARYNGOLOGY"
## [13] "UNDEFINED PHYSICIAN TYPE (SPECIFY)"
## [14] "HAND SURGERY"
## [15] "GERIATRIC MEDICINE"
## [16] "CLINICAL NURSE SPECIALIST"
## [17] "PLASTIC AND RECONSTRUCTIVE SURGERY"
## [18] "OBSTETRICS/GYNECOLOGY"
## [19] "INTERVENTIONAL PAIN MANAGEMENT"
## [20] "COLORECTAL SURGERY (PROCTOLOGY)"
## [21] "PAIN MANAGEMENT"
## [22] "PHYSICIAN ASSISTANT"
## [23] "PEDIATRIC MEDICINE"
## [24] "SURGICAL ONCOLOGY"
##
## [[7]]
## [1] "PATHOLOGY"
## [2] "GASTROENTEROLOGY"
## [3] "CERTIFIED REGISTERED NURSE ANESTHETIST"
## [4] "CARDIAC ELECTROPHYSIOLOGY"
## [5] "ANESTHESIOLOGY"
## [6] "GENERAL SURGERY"
## [7] "ANESTHESIOLOGY ASSISTANT"
##
## [[8]]
## [1] "OPHTHALMOLOGY" "PODIATRY" "FAMILY PRACTICE" "UROLOGY"
## [5] "DERMATOLOGY" "OPTOMETRY"
reclist
## [[1]]
## [1] "EMERGENCY MEDICINE" "VASCULAR SURGERY"
## [3] "NEUROLOGY" "NEPHROLOGY"
## [5] "INTERVENTIONAL RADIOLOGY" "PULMONARY DISEASE"
## [7] "INFECTIOUS DISEASE" "INTERVENTIONAL CARDIOLOGY"
## [9] "CARDIAC ELECTROPHYSIOLOGY" "ENDOCRINOLOGY"
## [11] "CRITICAL CARE (INTENSIVISTS)"
##
## [[2]]
## [1] "PODIATRY" "NURSE PRACTITIONER"
## [3] "PSYCHIATRY" "GERIATRIC MEDICINE"
## [5] "CLINICAL PSYCHOLOGIST" "CLINICAL SOCIAL WORKER"
##
## [[3]]
## [1] "CARDIAC SURGERY"
## [2] "ORTHOPEDIC SURGERY"
## [3] "THORACIC SURGERY"
## [4] "RHEUMATOLOGY"
## [5] "ALLERGY/IMMUNOLOGY"
## [6] "OSTEOPATHIC MANIPULATIVE MEDICINE"
## [7] "NEUROSURGERY"
## [8] "HOSPICE/PALLIATIVE CARE"
## [9] "CHIROPRACTIC"
## [10] "NUCLEAR MEDICINE"
## [11] "OTOLARYNGOLOGY"
## [12] "UNDEFINED PHYSICIAN TYPE (SPECIFY)"
## [13] "GYNECOLOGICAL ONCOLOGY"
## [14] "SPORTS MEDICINE"
## [15] "SLEEP LABORATORY/MEDICINE"
## [16] "CLINICAL NURSE SPECIALIST"
## [17] "OBSTETRICS/GYNECOLOGY"
## [18] "INTERVENTIONAL PAIN MANAGEMENT"
## [19] "COLORECTAL SURGERY (PROCTOLOGY)"
## [20] "ANESTHESIOLOGY ASSISTANT"
## [21] "PAIN MANAGEMENT"
## [22] "PHYSICIAN ASSISTANT"
## [23] "SURGICAL ONCOLOGY"
## [24] "CERTIFIED NURSE MIDWIFE"
##
## [[4]]
## [1] "PHYSICAL MEDICINE AND REHABILITATION"
## [2] "PHYSICAL THERAPY"
## [3] "HAND SURGERY"
## [4] "PERIPHERAL VASCULAR DISEASE"
## [5] "ADDICTION MEDICINE"
## [6] "OCCUPATIONAL THERAPY"
## [7] "NEUROPSYCHIATRY"
## [8] "PEDIATRIC MEDICINE"
## [9] "REGISTERED DIETITIAN OR NUTRITION PROFESSIONAL"
## [10] "GERIATRIC PSYCHIATRY"
## [11] "SPEECH LANGUAGE PATHOLOGIST"
## [12] "PREVENTATIVE MEDICINE"
## [13] "ORAL SURGERY (DENTIST ONLY)"
##
## [[5]]
## [1] "HEMATOLOGY/ONCOLOGY" "RADIATION ONCOLOGY" "MEDICAL ONCOLOGY"
## [4] "HEMATOLOGY"
##
## [[6]]
## [1] "INTERNAL MEDICINE"
## [2] "CARDIOVASCULAR DISEASE (CARDIOLOGY)"
## [3] "DIAGNOSTIC RADIOLOGY"
## [4] "AUDIOLOGIST"
##
## [[7]]
## [1] "PATHOLOGY"
## [2] "GASTROENTEROLOGY"
## [3] "CERTIFIED REGISTERED NURSE ANESTHETIST"
## [4] "ANESTHESIOLOGY"
## [5] "GENERAL SURGERY"
##
## [[8]]
## [1] "OPHTHALMOLOGY"
## [2] "FAMILY PRACTICE"
## [3] "UROLOGY"
## [4] "DERMATOLOGY"
## [5] "OPTOMETRY"
## [6] "GENERAL PRACTICE"
## [7] "PLASTIC AND RECONSTRUCTIVE SURGERY"