Exponential Random Graph Models

Many network models are inspired by the desire to recreate certain desired/realistic properties in the sampled graphs. “We know that graphs should have a long tailed degree distribution; how do we put this into a model?” OR “We know that graphs should have triangles; how do we put this into a model?”

Instead of recreating a new graph model for every new property, the exponential random graph model (ERGM) is a family of models that has the potential to recreate “any” property.

An introduction to the exponential family (think about real numbers/vectors)

The exponential family is interesting and important far outside of the domain of networks. It is one of the basic concepts that is covered in advanced mathematical-statistics course. It is a set of probability distributions that can be expressed as \[f_X(x\mid\boldsymbol \theta) = h(x) \exp\left(\sum_{i=1}^s \eta_i({\boldsymbol \theta}) T_i(x) - A({\boldsymbol \theta}) \right)\] for some parameter vector \(\theta\), “base measure” \(h(x)\), “natural parameter” \(\eta\), sufficient statistic \(T\), and log-partition function \(A\). Many (most?) familiar distributions belong to this family (e.g. Normal, Binomial, Poisson, Gamma, etc).

Exponential families are used in Generalized Linear Models (GLMs), finding uniformly most powerful tests, obtaining conjugate priors. For example, in GLMs, \(\eta_i\) is the \(\eta\) for \(Y_i\) and \(\eta_i = x_i'\beta\) (there is also a link function + other important details).

The sufficient statistic(s) are important because they completely characterize the likelihood \(\Pi_{\ell = 1}^N f_X(x_\ell \mid \theta)\) (this is a fun exercise for the exponential family if you haven’t done it!). The Rao-Blackwell theorem is helpful in emphasizing the importance of the sufficient statistics; sufficient statistics are important beyond the exponential family. In general, it can be hard to find the sufficient statistics for a given distribution. However, if it belongs to an exponential family, then you have a recipe.

Pedagogically, when we begin studying the exponential family, we often start with a distribution (e.g. Normal) and then derive its \(h, \eta, T, A\). We can show this for lots of other distributions as well. It generalizes several distributions. It is a generalization in the best sense because it is often easy to show things “in general” for all distributions in the exponential family.

The exponential family can be motivated in another way. It is the distribution with the largest entropy, subject to the constraint: \(E(T(X)) = \alpha\) for some function \(T\).
\[ max_{P} \mbox{entropy}(P) \ \ \mbox{ subject to } E_P(T(X)) = \alpha\] The solution to this problem is the exponential family with sufficient statistic \(T\); \(\eta\) is chosen to satisfy \(E(T(X)) = \alpha\). Read more here and here.

Networks as an exponential family

The Configuration Model says that we should sample uniformly from all graphs with the desired the degree distribution. If we care about triangles, then we could “extend” this model to sample from all graphs with the desired degree distribution AND the desired number of triangles. This is a bit awkward/difficult to implement. If we only care about the degree distribution, Chung and Lu relaxed the Configuration Model into a model that has the same degrees in expectation.

How do we do this for triangles? How do we find a distribution that has the same expected number of triangles? Perhaps we should find the distribution with the largest entropy (i.e. the most “uncertainty”) that satisfies our desires in expectation. This is exactly what the ERGM does.

In ERGMs, the random variable is not “\(X \in R\)”. Instead, it is a graph \(G\). ERGMs choose functions \(T\) to (for example) count the number of triangles, the number of k-stars, and any other motif of interest. Then, the graph is generated from the distribution with maximum entropy such that we match the observed number of triangles and k-stars (for example) in expectation.

Here are some potential problems with ERGMs.

Why should we care about the coefficients \(\eta\)? What do they mean? How do we interpret the estimated values? If you have a normal distribution, we like the parameterization using \(\mu\) and \(\sigma^2\); in the exponential family, \(\eta = (\mu/\sigma^2, -1/\sigma^2)\) and \(T = (x, x^2)\). The normal distribution is, in many ways, a nice distribution. With this “nice” example, it is not intuitively clear why sufficient statistics \(T\) are the “most essential” thing about the distribution. Moreover, how do you interpret \(\eta\)?
What is the generative model that leads to an ERGM. Without a generative model, how do we motivate and assess the plausibility of a model? Maximizing entropy is interesting, but it is not a process that describes how we can generate a network (in practice you need to use MCMC). In the context of RDS, Forrest Crawford has shown how you can “sample” an ERGM from sampling a much larger graph; this is a generative model for one version of an ERGM. It would be super interesting if this could be generalized to an ERGM with more general sufficient statistics.
What if we do not believe that an ERGM actually generated the observed graph. Then, how do we interpret the “\(\eta\)’s”. Can the previous MLE theory be extend? Perhaps it is a trivial extension, but I don’t think so.
You rarely observe the “entire” network; it is always a subnetwork. Estimates from random subgraphs do not extend to the population; even if you have a proper sample that is independent of the edges. More here
Suppose you had a linear model and estimated \(\beta\) with 100 samples or 1000 samples, you still interpret the estimates of \(\beta\) in the same way (perhaps with more/less uncertainty). This is not the case for ERGMs. Suppose a graph with 100 nodes and another graph with 1000 nodes are fit with the same ERGM. The parameter estimates are not comparable. This is related to the difficulty of interpreting the \(\eta\) coefficients.
If you could compute the log-partition function \(A\), then it would be straightforward to estimate \(\eta\). In general, it is exceedingly hard to compute \(A\). Fitting the models requires MCMC; when do you stop your chain? When has it converged? What if it “never” converges? (if link dead, it is cited in .Rmd comments.)
The popular packages to fit ERGM’s produces p-values. I do not think that there is “frequentist” theory to support these p-values.
Degeneracy: The sampled graphs can be exceedingly sensitive to small changes in parameter values; for some values, you (almost surely) get an empty graph and for others, you (almost surely) get a complete graph. This makes MCMC routines difficult to implement.
If you only have one graph, then your sample size is one.
Its super interesting that exponential families “maximize the entropy”, but why is this “the right thing to do”? Why is that our criteria?
How do you choose your sufficient statistics? Suppose you are interpreting the coefficient for triangles. Then, you fit another model that includes 4-cliques. How does your interpretation change?
Parameters \(\eta\) are “global” parameters; often we are more interested in the individuals.
Whenever you are struggling to find motivation for your work, say “it is natural to study” and/or “there has been growing interest in [model/algorithm] for [exceedingly broad discipline]”. I’ve said it too. It works. As a reader, beware.

You could ask, if a model has been around for so long and there are so many problems, why are we still studying it. Alternatively, you could ask how we could make improvements on these points.