Given a matrix \(A \in R^{n \times d}\), Vintage Sparse PCA (vsp) is a way of estimating interpretable latent factors for the rows and columns of \(A\). This is a unified way of estimating a broad class of different multivariate models. The manuscript that proposes and studies this technique is here. This document is more of a tutorial.
vsp with \(A\) estimates three matrices, \(\hat Z, \hat B, \hat Y\). The “row factors” are in a matrix \(\hat Z \in R^{n \times k}\) and the “column factors” are in a matrix \(\hat Y \in R^{d \times k}\). Then, there is a “middle \(\hat B\) matrix” that is $ k k$. With these matrices, \(A \approx \hat Z \hat B \hat Y^T\).
For example, if the rows of \(A\) form \(k\) different clusters, then the \(k\) columns of \(Z\) will be “cluster indicators” (\(\hat Z_{ij}\) is large if \(i\) is in cluster \(j\)).
At first, the middle \(\hat B\) matrix is confusing. So, from now on, we will presume that the primary focus is on the rows and we will define \(\tilde Y^T = \hat B \hat Y^T\). Then, \(A \approx \hat Z \tilde Y^T\). While \(\hat Y\) might have been sparse, \(\tilde Y\) might not be, but that is ok.
We will illustrate vsp with two examples. First, on the text of realdonaldtrump tweets. This example demonstrates one way of converting text into a (sparse) document-term matrix \(A\). Second, with a graph of citations among academic journals. This example demonstrates one way of converting a vertex set and edge list into a (sparse) adjacency matrix \(A\). Then, we will contextualize the clusters/factors with bff.
These are “sparse” matrices because the vast majority of the entries are zero (this is because the graphs are sparse!). We use library(Matrix) in R to represent sparse matrices. In short, it makes the algorithms really fast because it only “stores” the non-zero entries. We will talk more about it later.