Stat 992: Theory and Methods of Network Analysis

UW-Madison

Karl Rohe

email: first name and last name at stat dott wisc dott edu
office: 1239 MSC
Syllabus

Helpful links

Refresh material: An Introduction to R
Advanced R by Hadley Wickham
data.table is an R library that really speeds up computations on large data sets. Here is a cheat sheet.
Making beautiful documents with R markdown and knitR. Here are two cheat sheets.
Your work is useless if it is not conveyed effectively. Please consider visiting the campus writing center for any/all project(s).

Data

Physician referral network(s).
Information on physicians. Click on links on left.
More information on physicians (~200MB zipped, ~600MB expanded). data dictionary and homepage
Some NPI’s are individuals. Others are organizations! This file allows you to distinguish individuals from organizations.
A list of data sources.
others, more, etc.

Optional text:

Statistical Analysis of Network Data with R. By Kolaczyk and Csardi. Springer, 2014.

Week 1

In class

A starting point: MD referral networks.
Discussion: What year and department? What types of “network analysis things” do people do in your field?
Discussion: What statistics(ish) classes have you taken?
Discussion: Why are you in this class?
What is network analysis?
Why is Karl teaching this class?
Syllabus
Introductions to course aims.
Discussion: What type of data are you interested in studying? Topic of data source? Structure of data?
Authorship issues when discussing current research.
Basic notation. Graphs, nodes, edges, direction, degree, node induced subgraph, k-core
Check out the data. Inspect Madison.

At home

Make sure your version of R is up to date. The reason that I was not able to run the code on my laptop (that was running on my desktop) is that my version of R was just a little too old. fread in data.table was not running properly.
I will assume you know how to multiply matrices. Here is some practice.
I will assume you know what the eigendecomposition is. Here is what wikipedia says.

Week 2

Recap

What is network science? Maybe: We study the structure of the network in order to understand its functions and the processes that happen on the network.
We are studying an MD referral network (actually co-occurrence).
This is a type of data playground. Pedagogical rant for future instructors.
Basic notation. Graphs, nodes, edges, direction, degree, node induced subgraph, k-core.

In class

One aim of the course is to push you to be more adventurous with trying new libraries; googling things to find answers, code snippets, etc. The best way to quickly write code is to find someone else’s! R is a platform.
Continue inspecting madison.
Play with the zipcodes
Basic notation. connected component, clusters, etc.
ACS intro
Link edge lengths to ACS
Eigenvectors, eigendecomposition, spectral clustering V0

At home

Make sure your version of R is up to date.
Here is a list of the common graph notation. We will use things up to and including the section on subgraphs. Also, see graph properties on the second page.
Open the code from class (and this one and this one) and get it running on your computer. Remember, library(data.table) only works after installing with install.packages(“data.table”). Same for igraph and other libraries.
Look over the playground of possible data sets.
Exercise: Pick a hospital and pick a physician characteristic. If the characteristic is not self explanatory, describe the characteristic in words; you might have to do a google search to understand it a bit more. Then, load the data and describe how the characteristic relates to the structure of the hospital’s referral network. What do you think explains this relationship? Make an Rmarkdown document and describe your analysis to a friend.
Get a key to download data from the American Community Survey.
I made a Piazza course. Sorry for the delay in those who have asked… I didn’t know what you were talking about! I have sent an email invite for those enrolled. Others are more than welcome to join. The class is entered as “Stat 992”. If you need an invite, please send me an email.

Week 3

Announcements

Forrest Crawford will be giving a guest lecture on some open problems in Thursday’s class. He will give the department seminar on Wednesday and we will have a roundtable discussion. If any student or faculty has a particular interest in snowball or RDS sampling methods, we will have a round table discussion with Forrest at 1:30 in room 1217c MSC. Of course, MSC is a nightmare to navigate! 1217c is easy to find; it is just across the hall from Karl’s office (click link for fabulous, 11 second, student-made navigation video!). Please feel free to circulate any of the above announcements: Thursday class, Wednesday seminar, Wednesday roundtable.
We have not talked about dynamic graphs, but one interesting feature of our medicare referral data is that (i) we have multiple years of data and (ii) some years are before the implementation of the affordable care act (aka Obamacare) and some years are after the implementation. This would be an interesting angle for a class project.

In class

Intro to network driven sampling (i.e. webcrawling and snowball sampling).

Intro to Markov chains. Lecture notes
Reference text: Levin, Peres, Wilmer. Markov chains and mixing times.
unbiased estimation with Horvitz-Thompson / Inverse probability weighting / Volz-Heckathorn. Paper on computing the variance. Slides.

Thursday: discussion with guest Forrest Crawford.

Week 4

Recap

ACS intro
Link edge lengths to ACS
Have you opened the data from class? Have you explored any of the other data sources that we have not yet talked about?
Click on the links!!! I put them in our code documents and on this page so that you click on them. They contain fun and important stuff.

In class

Recall that in the Madison graph, it was tricky to label the nodes correctly. Since it was an exploratory example (i.e. not for formal publication), I didn’t take the time to make these plots. What I wanted to do then, but didn’t fit into class flow, was brush! Brushing! It needs Java. Ugh. install.packages(“JGR”); library(JGR); JGR(). From here, install.packages(“iplots”).

Brushing (and dynamic graphics more generally) are amazing exploratory tools. Use these tools when you have multidimensional data that you don’t (yet) understand.
On new OSX machines, you might need to install a Java update. You will get an error to tell you this.

Recall discussion of eigenvectors, eigendecomposition, spectral clustering V0. Note: SVD provides the best rank k approximation and SVD=eigen on a real symmetric matrix. Spectral Clustering
Merging two data sources for “continuous” clusters with CASC (click for class code). Click here for the original paper.

At home

Next week, we will start exploring edge directions. There are two basic algorithms that will reveal information that is both “global” and intimately related to edge direction.

pageRank for ranking. Original source. That algorithm made its creators more than 60 billion dollars; they had to share the $ with each other :(
Co-clustering for directed graphs with Di-Sim. The second link applies co-cluster to directed graphs. The first paper is the original paper on co-clustering, which was not motivated by directed graphs, but is a super fun read.
Data that Eva mentioned: surgeon score card, consumers checkbook rank MD’s, Dartmouth Atlas has lots of interesting things.
Eva’s slides

Week 5

Recap

When you have multivariate data that you don’t really understand, use brushing for dynamic visualization.
Eigenvectors of $A$ and the regularized $L_{sym}$ reveal partitions in the network. If you want to find $k > 2$ clusters, find the $k$ leading eigenvectors and make an $N \times k$ matrix. Normalize the rows and run k-means. This is spectral clustering.
If you have disparate sources of network data, you can add the adjacency matrices together and then run spectral clustering! I call this CASC, but I am not the first person to do it! Superstar grad student Yilin Zhang is now trying to “make it better”. Perhaps later in the semester she can describe her extensions.]
Question after class: “Why does spectral clustering work?!” There are lots of reasons; none are fully satisfying.

In class

Finish CASC.
Explore the referral network with hierarchical clustering. The approach used here is a combination of CASC and minimum spanning trees! Minimum spanning trees are amazing. You can read about the clustering algorithm here. (Actually, the clustering algorithm actually uses a “maximum spanning tree”.)
Relationship between normalized cuts (i.e. ncut) and eigenvectors of $L_{sym}$ on slide 46
Why does spectral clustering work?
Notes on the eigendecomposition, the singular value decomposition, computation, etc.
Study the asymmetry in the “specialty” graph.

At home

Find one “external” source of data that interests you. “External” is something that has not been analyzed in class, e.g. ACS data or the physician characteristic file. Your external data could link to NPI, hospital, or something else. Link this data to the referral network. Make (roughly) three summary figures to illustrate how it relates. These could be maps, scatter plots, whatever . . . as long as they relate your data to the data we have analyzed in class. Write an abstract/summary in Markdown, maximum 300 words. Include figures with captions. Due on Thursday October 8. If you are having trouble getting started, here is a list of possible data sources. Here are some others that Eva discussed: surgeon score card, consumers checkbook rank MD’s, Dartmouth Atlas. Finally, here is another source.
Here is some cool data. You want to click on the tab delimited link. If you want to download the data, you will have to accept a legal agreement. So long as you only use this for a class project, it is my understanding that this is fully within the allowances of the agreement.

Week 6

Recap

CASC allows you to incorporate heterogeneous sources of information.
Hierarchical clustering allows you to find “nested partitions”. It can be computed quickly with minimum spanning trees. This algorithm gives a sense of the vast disparity in the flavor of clustering algorithms.
Algorithmic complexity, $O(n^2)$ vs $O(|E|)$.
Why does spectral clustering work (three reasons).
Everyone has data by now… right?!?!?! Turn it in on learn at UW.
I prefer homework turned in in .html via learn@uw. However, .pdf is ok and email is also ok.

In class

Finish notes on the eigendecomposition, the singular value decomposition, computation, etc.
The Stochastic Co-Blockmodel for modeling asymmetry.
Finish study the asymmetry in the “specialty” graph.
Illustrating Stochastic Blockmodels with the specialty graph.
Anonymous feedback.

Week 7

Recap/announcements

I will give the “applied algebra” seminar this Friday at 10am in 901 Van Vleck. I will talk about clustering graphs and making it practical. It will review some things from class in a “seminar” format. See the abstract here.
No class Tuesday Oct 20.
The Stochastic Blockmodel can be “fit” with an observed $z$. The resulting estimate of “$B$” can be interpreted as a new network; we did this for zip codes and specialty type. If the old network is undirected, then so is $B$. If it was directed, then so is $B$. Either way, the SBM “approximates” the old network with a unipartite network. However, if the underlying network is directed, then we can alternatively fit the Stochastic Co-Blockmodel. This approximates the directed network with a bipartite network. Each node in the bipartite graph is a “sending” node or a “receiving” node. Each node in the original network is assigned to exactly one sending node and exactly one receiving node. When you are thinking of your projects, you will likely want to “simplify” the massive network in some way. One way is to focus on some geographic region (e.g. Madison). Another way is to fit an SBM with some observed $z$.
Project teams
Project description
Feedback.

In class

advice in plots:

use log transform in histograms. You can actually make a theory out of that!
Colors are super important. Since we now publish on the web, use colors liberally! Your color palette reveals how you think about your variables. If the variable is continuous, use something continuous like col = grey(seq(1,0,len=100)). There are, of course, much better options. The JHU faculty who write the simply statistics blog have this to suggest.

As we form teams, engage with your own curiosity! Ask questions! Do not strive for a question or an answer. Accept that these will come with effort. Engage with the data. Search for papers in the literature. Throughout this task, maintain a spirit of inquiry and acknowledge that the path is uncertain. I think this is good advice for research in general.
If you take a subgraph and it becomes super sparse, then something has gone wrong!!! It is their connections that are interesting! When it is so disconnected, there is very little “global structure” to understand! The graph can be easily summarized (e.g. as X dyads, four 2-stars, and one 11-star); we don’t need any algorithms!
Empirical regularities: sparsity, reciprocity, degree, triangles (transitivity ratio vs clustering coefficient), communities at various resolutions, small diameter, motifs.
Low rank matrix estimation.
Facts about Erdos-Renyi random graphs

At home

At the very least, read the abstracts and introductions of the six papers in this zip folder.

Week 8

Recap/announcements

Use log transform in histograms. Colors are super important (continuous vs categorical)
Empirical regularities: sparsity, reciprocity, degree, triangles (transitivity ratio vs clustering coefficient), communities at various resolutions, small diameter, motifs.
Email a “pre-proposal” to me by Thursday October 29. This should be a couple paragraphs. No pictures. Just an email and cc all group members. The email subject line should be 992 preproposal.

In class

No class on Tuesday.
Continue facts about Erdos-Renyi random graphs
Low rank matrix estimation.

At home

Read the six papers in this zip folder.
Email a “pre-proposal” to me by Thursday October 29. This should be a couple paragraphs. No pictures. Just an email and cc all group members. The email subject line should be 992 preproposal.

Week 9

Recap/announcements

Facts about Erdos-Renyi random graphs. Added simulations to illustrate the points.
Low rank matrix estimation.
In your projects, you can compare different regions! How does the structure and function in one region compare to another region? You could use HRR’s to provide the regions. Here, each region is a unit of analysis.
One thing that is super interesting is that the payment data gives # of procedures. Is this a better measure of network degree??? How could this be used in interesting ways?
preproposals

In class

Group projects should give you the perspective of an “applied network researcher”. How do statistical models help you?
Empirical graphs have short pairwise distances. Play this game to try and find some short paths.
Sampling graphs with given degree sequence.
Preferential Attachment Model.
The exponential random graph model

At home

Read the six papers in this zip folder.
Email a “pre-proposal” to me by Thursday October 29. This should be a couple paragraphs. No pictures. Just an email and cc all group members. The email subject line should be 992 preproposal.

Week 10

Recap/announcements

Preferential Attachment Model.
The exponential random graph model
Explore the dynamic network!
Some recently discovered problems with our data!
You are going to struggle to understand the project and its expectations if you do not come to class. Asking me to grade your assignments, when it is very clear from your work that you have not been attending class, is offensive. If you plan on completing a project, please come to class.
Library FNN does not make exact knn graphs. Perhaps this explains some strange results from the CASC code. library(spatgraphs) contains the best code to construct a knn graph (from low dimensional data) that I could find.
Class presentation of proposals happens next Tuesday. I would be more than happy to look at your presentation materials before you present.

In class

“Latent models” span a broad class of representations that are highly related.
New code to plot zip code clusters (use if you want). See examples
Any class time remaining will be used for discussions within groups to prepare for our Tuesday guests.

At home

Others have tried to survey the current landscape of network models and have made much more extensive reports than my lecture notes. For example, see here.
Prepare for presentations!

Week 11

Recap/announcements

Recall the first sentence of the project description: “Your project should (1) ask a question that is substantively interesting, (2) answer this question (this is your thesis statement), (3) provide evidence to support your answer, and (4) discuss the limitations of your analysis.”

In class

Present project proposals in class to Raj Mukherjee MD; Sara Fernandes Taylor PhD; Professor Eva DuGoff PhD, MPP; Professor Kevin Look PharmD, PhD; and Professor Angie Ingraham MD, MS. (November 10)
Medical fraud detection using physician referrals and PartD prescription data
Part D Opioids and the Physician Referral Network
Hospital Complication and Readmission Scores in Relation to Hospital Referral Networks and Socioeconomic Deprivation in the United States
Predicting Average Medical Payment using Physician Referral Network at the Hospital Service Area Level
Analysis of Regional Mortality using Physician Network Characteristics
The effect of intra-hospital physician referral networks on readmission and mortality data
Why does long-distance referral happen?
The referral network within internal medicians and the network between internal medicians and other specialities (Bingying Xie et al)
Group 2
Group 2 proposal (Rong Ma et al)
Confidence intervals for graphs

At home

Week 12

Recap/announcements

If you need to download the data that was removed from the medicare website, you can download it here.

In class

No class on 11/17. We will have a makeup class on 11/20; Tyler McCormick will give a talk titled “Multiple Questions on Multiple Scales: Multiresolution Social Network Models” at 11:30 in room 1210 MSC.
On 11/19, Jason Fletcher from the La Follette School of Public Affairs will talk about his recent work.

Week 13

Recap/announcements

Did you want more referral data that was removed from the CMS website? Someone is still sharing it.
Schedule for the rest of the semester.
Marlon next Tuesday! Paper 1. Paper 2. He will be teaching a networks course next semester: syllabus.

In class

Finish group proposals.
Review progress on projects.
Notes on proposals
Comments from Raj.
No class on 11/26, Thanksgiving.

At home

Week 14

Recap/announcements

Group presentations on December 8 and 10!!!
A breakthrough in ergm-like models.

In class

Marlon Mundt, guest lecture.
Covariate assisted spectral clustering wth bag of words, from Yilin Zhang.

At home

Week 15

In class

The order of the group presentations are below. I have kept the links to the proposals because they clarify the names of the group members. If you want to project slides from my computer, please send me the files via email with the subject line 992asdf. If you have a ppt deck, then please please please convert to pdf before sending. Marlon’s slides got shifted around because of this.
Hospital Complication and Readmission Scores in Relation to Hospital Referral Networks and Socioeconomic Deprivation in the United States
Predicting Average Medical Payment using Physician Referral Network at the Hospital Service Area Level
Analysis of Regional Mortality using Physician Network Characteristics
The effect of intra-hospital physician referral networks on readmission and mortality data
Medical fraud detection using physician referrals and PartD prescription data
Part D Opioids and the Physician Referral Network
Why does long-distance referral happen?
The referral network within internal medicians and the network between internal medicians and other specialities (Bingying Xie et al)
Group 2
Group 2 proposal (Rong Ma et al)
Confidence intervals for graphs

At home

Finish group projects and submit via learn@uw by Thursday Dec 17.