Why does spectral clustering work?

Previous classes have illustrated that spectral clustering can find partitions in the network. Why does it work? There are three general types of arguments that justify spectral clustering. In fact, they are (roughly) the same three types of arguments that are used to justify ordinary least squares in linear regression. The qualification of “roughly” will be explained below.

Justifications for ordinarly least squares

1) Justification under exact representation Suppose that you have data points $(y_i, x_i) \in R \times R^p$ for $i = 1, \dots, n$. Suppose that there exist a set of $\beta_j$’s such that \[y_i = \beta_0 + \sum_j \beta_j x_j\] for all $i$. Under this exact representation, we can compute these $\beta_j$’s by choosing them to minimize \[\min_b \sum_i (y_i -( b_0 + \sum_j b_j x_j))^2\].
This is ordinary least squares (OLS). Ordinarly least squares is a convex problem and it can be optimized quickly. If $x_i$ contains $p$ elements, then we can compute these $p+1$ $\beta_j$’s in $O(np^2)$ time.

In particular, define $X \in R^{n \times (p+1)}$ to contain $(1,x_i) \in R^{p+1}$ in the $i$th row and $Y\in R^n$ to contain $y_i$ in the $i$th element, then the OLS solution solves the system of equations \[(X^TX) \beta = X^TY, \mbox{ often expressed as } \beta = (X^TX)^{-1}X^TY.\] Where $\beta \in R^{p+1}$ contains the elements that solve $y_i = \beta_0 + \sum_j \beta_j x_j$.

2) Justification with approximation and optimization. When $n>p$, then there will not always exist a vector $\beta$ that satisfies $y_i = \beta_0 + \sum_j \beta_j x_j$ for all $i$. In this more general setting, we could instead “approximate” $y_i$ with $x_i$ as a function of the form $b_0 + \sum_j b_j x_j$, then we would like to choose the $b_j$’s to minimize the approximation error. One way of measuring the error is by the squared residual: \[\mbox{residual}_i^2 = (y_i -(b_0 + \sum_j b_j x_j))^2.\] Given that we have $n$ data points, we should make this residual small across all values of $i$. One way is to simply add up the squared residuals across all $i$. That is, we should choose the $b_j$’s to minimize \[\min_{b_0, b_1} \sum_i \mbox{residual}_i^2.\] Let $\hat b_0, \hat b_1, \dots, \hat b_p$ be the arguments that obtain the minimum. This is ordinary least squares and it allows us to approximate $y_i$ with a function $\hat b_0 + \sum_j \hat b_j x_j$. Just as before, the $\hat b$ can be computed in $O(np^2)$ time by solving the system of linear equations \[(X^TX) \hat b = X^TY, \mbox{ often expressed as } \hat b = (X^TX)^{-1}X^TY.\]

The above argument provides a justification for using a linear combination of the elements of $x_i$ to approximate $y_i$. This argument does not rely on random variables or any other type of probabilistic argument. The key bit is to write down an optimization problem with a loss function that measures the “goodness” of the approximation.

3) Justification with a statistical model. Suppose that \[Y = X \beta + \epsilon,\] where $\epsilon \in R^n$ is a random vector of noises that satisfies $E(\epsilon) = 0$ and $Cov(\epsilon) = \sigma^2 I$. Here, $\beta$ now serves the role of $b$, but it is a “parameter” in this statistical model. The Gauss-Markov theorem says that $\hat \beta$ is the BLU estimator. This is an acronym for Best (i.e. smallest variance), Linear (i.e. $\hat \beta = \gamma Y$ for some $\gamma$), and Unbiased (i.e $E(\hat \beta) = \beta$).

Moreover, if $\epsilon \sim N(0, \sigma^2 I)$, then $\hat \beta$ is the maximum likelihood estimator.

Justifications for spectral clustering

1) Justification under exact representation Suppose that you have an unweighted and undirected graph which has $k$ connected components. Let $A$ be the adjacency matrix. Then spectral clustering with $D^{-1/2}AD^{-1/2}$ will exactly reveal these $k$ components. To see this, note that for any vector $x \in R^n$, \[x^T(I - D^{-1/2}AD^{-1/2})x = \frac{1}{2} \sum_{(i,j) \in E}\left(\frac{x_i}{\sqrt{deg(i)}} - \frac{x_j}{\sqrt{deg(j)}}\right)^2.\] This proof is left as an exercise, or the reader can consult Proposition 3 in von Luxburg’s Tutorial on Spectral Clustering. This expression shows that $I - D^{-1/2}AD^{-1/2}$ is positive semi-definite, with the smallest eigenvalue being zero. Suppose that $S \subset V$ is a connected component. Define $f_S$ to be a vector with the $i$th element equal to zero if $i \not \in S$. Otherwise, if $i \in S$, then the $i$th element is $1/\sqrt{deg(i)}$. Note that $f_S$ is an eigenvector with eigenvalue zero. This shows that if there are $k$ connected components, then the multiplicity of the eigenvalue zero is (at least) $k$. It is left as an exercise (or consult the tutorial) to show that the multiplicity is exactly $k$. To conclude the argument, note that (1) an eigenvector of $I - D^{-1/2}AD^{-1/2}$ is also an eigenvector of $D^{-1/2}AD^{-1/2}$ and (2) if $\lambda$ is the eigenvalue from $I- D^{-1/2}AD^{-1/2}$, then the eigenvalue of $D^{-1/2}AD^{-1/2}$ is $1-\lambda$. So, if there are $k$ connected components, then $D^{-1/2}AD^{-1/2}$ has exactly $k$ eigenvalues which are equal to one. Compute the $n \times k$ eigenvector matrix (containing the columns $f_S$ for each connected component). Normalize the rows. There are exactly $k$ unique rows of this matrix and these rows are exactly the standard basis vectors. Running k-means will reveal these $k$ clusters.

This argument shows that if there are $k$ “perfect clusters” (i.e. $k$ connected components), then spectral clustering will reveal these $k$ clusters.

There is another possible “exact representation” for weighted graphs. Define the weighted adjacency matrix $A \in R^{n times n}$ to contain the weight of edge $(i,j)$ in the $i,j$th element. Suppose that the graph is such that there are $k$ “blocks” of nodes, where each node $i$ belongs to the block $z(i) \in {1, \dots, k}$. Here $z$ is a function that “labels” each node into its block. The weighted graph satisfies the “exact representation” if there is a “block weight” matrix $B \in R^{k \times k}$ such that the weight of an edge between any two nodes $i$ and $j$ is exactly \[A_{ij} = B_{z(i), z(j)}.\] Define $Z \in \{0,1\}^{n \times k}$ to have $Z_{ij} = 1$ if $z(i) = j$ and otherwise $Z_{ij}=0$. Then, \[A = ZBZ^T\] showing that the rank of $A$ is no greater than $k$. Under mild conditions on $B$ and $Z$ (i.e full rank), $rank(A) =k$. Again for this exact representation, spectral clustering will exactly find the $k$ blocks.

2) Justification with approximation and optimization.
In their paper “Normalized Cuts and Image Segmentation“” in 2000, Shi and Malik propose an objective function called “normalized cuts” to measure the goodnesss of a partition. For two sets of nodes $S_1, S_2$ and the adjacency matrix $A$, define $A(S_1, S_2)$ to be the sum over the submatrix of $A$ created by retaining the rows in $S_1$ and the columns in $S_2$. For example, if $S_2 = S_1^c$, then $A(S_1, S_2) = cut(S_1)$, i.e. the number of edges that must be cut to remove vertices $S_1$ from the graph. Also, for the entire nodes set $V$, $A(S_1, V) = vol(S_1)$, i.e. the sum of the node degrees in set $S_1$. For a partition defined by $S$ and $S^c$, the value of the normalized cut is \[ncut(S) = \frac{A(S,S^c)}{A(S,V)} + \frac{A(S,S^c)}{A(S^c,V)}.\] If spectral clustering could be used to exactly minimize $ncut(S)$ over all partitions $S$, then we would not need the qualification of “roughly” in the first paragraph. Unfortunately, $ncut$ cannot be optimized in polynomial time.

To understand how we can try to optimze $ncut$ over possible partitions $S$, define $b = A(S,V)/A(S^c,V)$. For a partition $S$, define the vector $y\in R^n$, where $y_i = 1$ if $i \in S$ and $y_i = -b$ if $i \in S^c$. Then, \[min_S ncut(S) = min_y \frac{y^T(D-A)y}{y^T D y}, \mbox{ subject to } y_i \in \{1, -b\} \mbox{ for all $i$ and } y^TD\textbf{1}\] where $D$ is the diagonal degree matrix from spectral clustering and $\textbf{1}$ is a vector of ones. Unfortunately, neither of these optimization problems is computationally tractable. However, if we relax the contstraint $y_i \in \{1, -b\}$ for all $i$ to allow for any real valued entries, then the optimization problem on the right hand side is exactly the Rayleigh quotient, which is exactly optimized by the second smallest eigenvector of $I -D^{-1/2}AD^{-1/2}$. After computing this eigenvecotor $y \in R^n$, threshold the values so that if $i \in S$ if and only if $y_i \le 0$.

ncut is intuitively meaningful. However, there are several ways that it could be modified, while still remaining intuitively meaningful. As such, there is nothing sacrosanct about the ncut objective function. One minor change would be to change the denominator as is done with the following measure of the “goodness” of partition $S \subset V$: \[h_G(S) = \frac{|A(S, S^c)|}{\min(A(S,V), A(S^c,V))}.\] When $h_G(S)$ is small, there are a small number of edges between $S$ and $S^c$ relative to the total number of edges connected to nodes in $S$ (or $S^c$). Define \[h_G = \min_S h_G(S),\] where the minimum is over all partitions.This is the Cheeger constant and it measures how well the graph can be partitioned. While $h_G$ is NP-hard to compute, the Cheeger bound is \[2 h_G \ge \gamma \ge \frac{h_G^2}{2},\] where $\gamma$ is the spectral gap of $D^{-1/2}AD^{-1/2}$, or $1-\lambda_2$ where $\lambda_2$ is the second largest eigenvalue of $D^{-1/2}AD^{-1/2}$. The Cheeger bound gives an explicit relationship between the Cheeger constant (which measures the partitionability of $G$) and the eigenvalues of $L_{sym}$. The proof of the Cheeger bound relies upon showing that thresholding the eigenvector provides a partition $S$ for which $h_G(S)$ has a small value.

These results justify spectral clustering as an optimization problem; just as in OLS, they do not require a model for the underlying network. Instead, they show that the algorithm (i.e. OLS or spectral clustering) provides an approximation and this is formalized by optimizing some objective function (e.g. sum of squared residuals, ncut, or $h_G(S)$).

3) Justification with a statistical model.

Suppose that the observed graph is a sample from a random graph model. Presume the sampled graphs are undirected and unweighted. Perhaps it is easiest to think of the graph with its adjacency matrix $A$, which is now a random matrix.

The Stochastic Blockmodel (Holland et al, 1983) is one model for random graphs. Under this model, each node is assigned a value $z(i) \in \{1, \dots, k\}$. These values are sampled iid from the multinomial$(\pi_1, \dots, \pi_k)$ distribution. Conditional on these assignements, the edges are independent and the probability of an edge between $i$ and $j$ depends only on $z(i)$ and $z(j)$. Using the “exact representation” for a weighted graph, \[E(A|z) = ZBZ^T\] for the $k\times k$ matrix $B$ that contains the probabilities. We can then study the leading eigenvectors of the “population” version of $L_{sym}$, defined as $\mathcal{L}_{sym} = E(D|z)^{-1/2} E(A|z) E(D|z) ^{-1/2}$. By the arguments above, these “population” eigenvectors exactly reveal the block membership. So, running k-means on the population eigenvectors reveals the true blocks. Of course, we do not observe $E(A|z)$, only $A$. Because $A$ is random, the leading eigenvectors of $L_{sym}$ are random vectors. Under certain conditions, these eigenvectors are close to the population eigenvectors. The basic argument bounds the spectral norm, $\|A - E(A|z)\|$ and extends this through to the eigenvectors using the Davis-Kahan theorem (e.g. see here). To bound the spectral norm, use the fact that $A$ can be expressed as a sum of $\choose{n 2}$ idendendent matrices (one for each edge). Then, apply theorems that show sums of random matrices concentrate around their expectation. For example, see this paper.

The Stochastic Blockmodel (SBM) has been generalized in several ways, the Degree-corrected SBM allows for heterogeneous degrees. Under this model, you see why normalization of the rows of the eigenvector matrix is important. SCORE] studies the eigenvectors of $A$. We studied the regularized version of $L_{sym}$ under the DC-SBM in this paper. Overlapping SBM’s and Mixed membership SBM’s encode some concepts that allow for nodes to have multiple or partial memberships in different groups; as far as I know, these models have not yet been studied under spectral techniques. That doesn’t mean we haven’t tried! Happy to talk more about this in office hours.