Stat 992 Lecture Notes.

I plan to use this document to highlight the main points of lecture.

Introduction to the course.

Course projects are key.
The course has three parts.
- tutorial/methodological
- background / technical & project proposals
- project presentations
schedule office hours (next week)
poll: tidyverse? dplyr?
the data I think we should use in our projects.
Do you have a place that you could afford to download 100GB and have another ~100GB for other files?

Themes of data science with graphs

Themes of data science. Due to the internet, we can now rapidly share both data and software. This enables a rich web of dependencies in both, from which data science is emergent.

Data Science with graphs: four key techniques….

Sampling: make sure you get the thing you are looking for! Think like a “scientist”

What is aggregation? Putting nodes/edges together… finding a smaller node set where each original node corresponds to one of the smaller nodes.

factoring/clustering. what happens when you cluster/factor the journal-journal graph?

contextualization.

Introduction to graphs

citation graph, wikipedia, and twitter data.

Twitter api accessible in twitter via rtweet.

murmuration, bitcoin, others…
graph notation/defintions:
- a graph is both a vertex/node set \(V = (1, \dots, n)\) indexing the people and an edge set \(E\) is the set of vertex pairs that are friends. So, if \(i\) and \(j\) are friends, then \((i,j) \in E\) (and vice versa). We write the graph as \(G = (V,E)\), but this is just notation.
- Edges can be undirected \((i,j) \in E\) implies \((j,i) \in E\). Friendships on Facebook are like this. Edges can also be directed where it is feasible for \((i,j) \in E\), without implying that \((j,i) \in E\). Following patterns on Twitter are like this.
- I like to represent the graph as an adjacency matrix \(A\), where \(A_{ij} \in \{0,1\}\) indicates if \((i,j) \in E\). Q: when is the matrix \(A\) symmetric?
- degree: a node’s degree is the number of friends it has (directed graphs have both in- and out-degree). Node degrees are the row (or column) sums of \(A\).
- sub-graph: you can create a graph by taking a subset of nodes in \(V\) and a subset of the edges in \(E\). Obviously, if you don’t include node \(i\) in your subgraph, then you necessarily exclude all edges connected to \(i\) in \(E\). But if \(i\) is included in the subgraph, then you do not need to include all edges connected to \(i\)!
- node-induced subgraph: in a node-induced subgraph, you take a subset of nodes and include all edges among those nodes.
- k-core: the largest node-induced subgraph where every node has at least degree k in the node-induced subgraph. Note: it is not enough to simply remove nodes with degree less than k!
- path: A path in a graph is a sequence of edges which join a sequence of vertices. e.g. \(\{(1,3), (3,10), (10,5)\}\) is a path from 1 to 5. \(\{(1,4), (5,7)\}\) is not a path because the first edge ends at node 4 and the second edge starts at node 5.
- weighted graph: Sometimes, each edge in the graph has a weight. This weight is typically presumed to be non-negative. For example, \((i,j)\) might indicate that \(i\) sent \(j\) a message. The weight might be how many messages. In this case, the adjacency matrix elements \(A_{ij}\) can be set to the weight of the edge \((i,j)\). In that notation, no edge is equivalent to edge weight zero (seems reasonable).

Empirical regularities (network science)

sparsity (graphs are big. most people are connected to only a small number of others.),
reciprocity (if you are my friend, then I am your friend),
heavy tailed degree distribution (e.g. rich get richer). Power law degree distributions? Or perhaps, just heavy tailed…
core-periphery (densely connected “core” of nodes and a “periphery” of nodes that is weakly connected to core),
transitivity/triangles (your friends are likely to be my friends) and other motifs,
homophily (friends are similar in lots of ways. “birds of a feather fly together”)
low diameter / low average path length. For any two nodes in the graph, there is a short path between them. e.g. “6 degrees of separation”.
“small world” is the combination of sparse, transitive, and low diameter.
communities at various resolutions.
Basic model: latent space model
conditional on everyone’s latent features \(z_i\), friendships are independent with \(P(i - j|z_i, z_j) = p_{ij} = f(z_i,z_j)\), e.g. \(f(z_i,z_j) = \langle z_i, z_j\rangle\).

Tiny facts:

\([A^q]_{ij}\) is the number of paths from \(i\) to \(j\) in \(q\) steps.
Theorem: Suppose a simple graph (unweighted, undirected, no loops). The sum of the eigenvalues of \(A\) is equal to zero. The sum of the eigenvalues, each squared, is equal to the number edges divided by two. The sum of the eigenvalues, each to the third, is equal to the number of triangles divided by 6.

Illustration

VSP on realdonald tweets

What is VSP?

Here is the VSP paper

After illustrating bff on journal clusters with journal names, show bff on clusters of statistics papers… clustered by citation (or co-authorship) and contextualized by their abstracts via bff. See here. The source code for bff is here. A more standard approach to this problem is tf-idf.

PageRank, Personalized PageRank, two-way analysis

Alex gave a great talk on using ppr to sample twitter using appr

Discuss first assignment: blog posts.

Blog posts

Data

Download this journal citation graph by clicking on V and E: (V, E).

How to get data from SemanticScholar into R

How to ppr sample Twitter

Why do things work?

Lecture notes on eigen and svd old notes

Lecture notes on why spectral clustering works

A good definition of rank

Power method; to compute leading eigenvector

“spectral graph theory” 1

The basic tools and how to use them.

Graph factoring/clustering (we will not study the technical bits of the cited papers)
- Spectral clustering (von Luxburg. We are interested in the situation where the graph is already given. So, ignore the bits on graph construction and Euclidean data.)
- Stochastic Blockmodels and spectral clustering (Rohe, Chatterjee, Yu)
- Limits of detection (Mossel, Neeman, Sly)
- The role of regularization (Le, Levina, Vershynin. Zhang and Rohe. Hayes and Rohe (link soon))
- Aggregation
- semi-parametric model: E(A) = ZBY’ (Rohe and Zeng)
- You should rotate your eigenvectors (Rohe and Zeng. Chen and Rohe (link soon))
- choosing k (lecture notes, joint work with Hayes, Chen, and Roch).
- related:
  - using high dimensional features to enhance graph clustering.
  - Valued graphs,
  - citation graphs,
  - tensors,
  - phylogenetic tree reconstruction,
Sampling
- sampling edges. sampling nodes.
- targeted sampling and how two graduate students made themselves billionaires.
Finding features

The underlying technical tools:

This is a list of things that I would like to cover, but there is insufficient time to cover it all. We will be doing these things concurrently with project proposals and group work on projects.

Power method.
Erdos-renyi facts; scaling, largest component, locally tree like, diameter
Latent space models
(Hoff, Raftery, Handcock). (Conditional edge independence, infinite node-exchangability, graphons.)
concentration of A.
Wyle, Davis-Kahan, and Cape.
latent space vs ERGM.
aldous-hoover and graphons
Cheegar cuts (and core-cut)
Representative sampling (RDS)
- two regimes. - low variance estimators.