Stat 992 Lecture Notes.
I plan to use this document to highlight the main points of lecture.
Introduction to the course.
- Course projects are key.
- The course has three parts.
- tutorial/methodological
- background / technical & project proposals
- project presentations
- schedule office hours (next week)
- poll: tidyverse? dplyr?
- the data I think we should use in our projects.
- Do you have a place that you could afford to download 100GB and have another ~100GB for other files?
Themes of data science with graphs
Themes of data science. Due to the internet, we can now rapidly share both data and software. This enables a rich web of dependencies in both, from which data science is emergent.
Data Science with graphs: four key techniques….
Sampling: make sure you get the thing you are looking for! Think like a “scientist”
What is aggregation? Putting nodes/edges together… finding a smaller node set where each original node corresponds to one of the smaller nodes.
factoring/clustering. what happens when you cluster/factor the journal-journal graph?
contextualization.
Introduction to graphs
Twitter api accessible in twitter via rtweet.
- murmuration, bitcoin, others…
- graph notation/defintions:
- a graph is both a vertex/node set \(V = (1, \dots, n)\) indexing the people and an edge set \(E\) is the set of vertex pairs that are friends. So, if \(i\) and \(j\) are friends, then \((i,j) \in E\) (and vice versa). We write the graph as \(G = (V,E)\), but this is just notation.
- Edges can be undirected \((i,j) \in E\) implies \((j,i) \in E\). Friendships on Facebook are like this. Edges can also be directed where it is feasible for \((i,j) \in E\), without implying that \((j,i) \in E\). Following patterns on Twitter are like this.
- I like to represent the graph as an adjacency matrix \(A\), where \(A_{ij} \in \{0,1\}\) indicates if \((i,j) \in E\). Q: when is the matrix \(A\) symmetric?
- degree: a node’s degree is the number of friends it has (directed graphs have both in- and out-degree). Node degrees are the row (or column) sums of \(A\).
- sub-graph: you can create a graph by taking a subset of nodes in \(V\) and a subset of the edges in \(E\). Obviously, if you don’t include node \(i\) in your subgraph, then you necessarily exclude all edges connected to \(i\) in \(E\). But if \(i\) is included in the subgraph, then you do not need to include all edges connected to \(i\)!
- node-induced subgraph: in a node-induced subgraph, you take a subset of nodes and include all edges among those nodes.
- k-core: the largest node-induced subgraph where every node has at least degree k in the node-induced subgraph. Note: it is not enough to simply remove nodes with degree less than k!
- path: A path in a graph is a sequence of edges which join a sequence of vertices. e.g. \(\{(1,3), (3,10), (10,5)\}\) is a path from 1 to 5. \(\{(1,4), (5,7)\}\) is not a path because the first edge ends at node 4 and the second edge starts at node 5.
- weighted graph: Sometimes, each edge in the graph has a weight. This weight is typically presumed to be non-negative. For example, \((i,j)\) might indicate that \(i\) sent \(j\) a message. The weight might be how many messages. In this case, the adjacency matrix elements \(A_{ij}\) can be set to the weight of the edge \((i,j)\). In that notation, no edge is equivalent to edge weight zero (seems reasonable).
Empirical regularities (network science)
- sparsity (graphs are big. most people are connected to only a small number of others.),
- reciprocity (if you are my friend, then I am your friend),
- heavy tailed degree distribution (e.g. rich get richer). Power law degree distributions? Or perhaps, just heavy tailed…
- core-periphery (densely connected “core” of nodes and a “periphery” of nodes that is weakly connected to core),
- transitivity/triangles (your friends are likely to be my friends) and other motifs,
- homophily (friends are similar in lots of ways. “birds of a feather fly together”)
- low diameter / low average path length. For any two nodes in the graph, there is a short path between them. e.g. “6 degrees of separation”.
- “small world” is the combination of sparse, transitive, and low diameter.
- communities at various resolutions.
- Basic model: latent space model
- conditional on everyone’s latent features \(z_i\), friendships are independent with \(P(i - j|z_i, z_j) = p_{ij} = f(z_i,z_j)\), e.g. \(f(z_i,z_j) = \langle z_i, z_j\rangle\).
Tiny facts:
- \([A^q]_{ij}\) is the number of paths from \(i\) to \(j\) in \(q\) steps.
- Theorem: Suppose a simple graph (unweighted, undirected, no loops). The sum of the eigenvalues of \(A\) is equal to zero. The sum of the eigenvalues, each squared, is equal to the number edges divided by two. The sum of the eigenvalues, each to the third, is equal to the number of triangles divided by 6.
Discuss first assignment: blog posts.
Blog posts