Sometimes the network is a “measurement device”

You should know about the Netflix Prize. It is an example of where “the network” is interesting, but it is not of fundamental interest. Instead, something happens on the network; two units interact and create a value (an affinity). Given observations on the network, we want to suggest two units that should interact to have a high affinity.

In the Netflix prize, the network is bipartite: {people \(\times\) movies} and the affinities are subjective movie ratings. There are other examples.

I think of this problem as a type of regression problem. It is often referred to as “low rank matrix estimation”. Here is the problem setup from this paper, not because it is cannonical, but because I have the .tex code:

Problem setup

The underlying matrix that we wish to estimate is an \(n \times d\) matrix \(M_0\) with rank \(k\). By singular value decomposition (SVD), \[M_0 = U \Lambda V^T,\] for orthonormal matrices \(U = (U_1,\dots,U_k) \in R^{n \times k} \ \mbox{ and } \ V = (V_1,\dots,V_k) \in R^{d \times k}\) containing the left and right singular vectors, and a diagonal matrix \(\Lambda = diag (\Lambda_1, \ldots, \Lambda_k) \in R^{k \times k}\) containing the singular values. \(M_0\) is corrupted by noise \(\epsilon \in R^{n \times d}\), where the entries of \(\epsilon\) are i.i.d. sub-Gaussian random variables with mean zero and variance \(\sigma^2\). Let \(y \in \{0,1\}^{n \times d}\) be such that \(y_{ij} = 1\) if the \((i,j)\)-th entry of \(M_0+\epsilon\) is observed and \(y_{ij} = 0\) if it is not observed. Perhaps the entries of \(y\) are i.i.d. Bernoulli(\(p\)) and independent of the entries of \(\epsilon\). We observe \(y\) and the partially observed matrix \(M \in R^{n\times d}\), \[ M_{ij} = \big[y \cdot ({M_0} + \epsilon)\big]_{ij} = \left\{\begin{matrix} {M_0}_{ij}+\epsilon_{ij} & \text{if observed ($y_{ij}=1$)}\\ 0 & \text{otherwise ($y_{ij}=0$)} \end{matrix}\right. \] for \(1\le i\le n\) and \(1\le j\le d\). We wish to estimate \(M_0\), or \(U,V,\) and \(\Lambda\). It is presumed that \(k << n\).

Standard estimator

Soft-impute is probably the most popular algorithm. It has an R package.

Recall that for a given matrix \(M\), the SVD solves the following optimization problem \[\arg \min_{\hat M \mbox{ rank $k$}} \sum_{ij}(M_{ij} - \hat M_{ij})^2.\] In the matrix completion problem, the following loss function makes more sense (notice the index on the summation): \[\arg \min_{\hat M \mbox{ rank $k$}} \sum_{ij: y_{ij} = 1}(M_{ij} - \hat M_{ij})^2.\] Unfortunately, we cannot solve this problem in polynomial time. It can be “relaxed” as follows: \[\arg \min_{\hat M} \sum_{ij: y_{ij} = 1}(M_{ij} - \hat M_{ij})^2 + \lambda \sum_{\ell} \sigma_\ell(\hat M),\] where \(\sigma_\ell(\hat M)\) is the \(\ell\)th singular value of \(\hat M\). The summation of these singular values is called the nuclear norm penalty and it has become very popular in statistics. This problem is convex and it can be solved via soft-impute (see link above). The basic idea is to initialize with the missing elements equal to zero, then compute the rank \(k\) SVD formulation, then impute the missing values \(i,j\) from the estimated singular vectors and singular values. Then, you can iterate this. In fact, if \(U, \Lambda,\) and \(V\) are the estimated singuar values (on a certain step), then the updated values are \(U t(\Lambda, \lambda) V^T\), where \(t(\cdot, \lambda)\) is the soft thresholding function applied elementwise.