One-Sided Matrix Completion from Ultra-Sparse Samples

Matrix completion is a classical problem that has received recurring interest across a wide range of fields. In this paper, we revisit this problem in an ultra-sparse sampling regime, where each entry of an unknown, $n\times d$ matrix $M$ (with $n \ge d$) is observed independently with probability $p = C / d$, for a fixed integer $C \ge 2$. This setting is motivated by applications involving large, sparse panel datasets, where the number of rows far exceeds the number of columns. When each row contains only $C$ entries -- fewer than the rank of $M$ -- accurate imputation of $M$ is impossible. Instead, we estimate the row span of $M$ or the averaged second-moment matrix $T = M^{\top} M / n$. The empirical second-moment matrix computed from observed entries exhibits non-random and sparse missingness. We propose an unbiased estimator that normalizes each nonzero entry of the second moment by its observed frequency, followed by gradient descent to impute the missing entries of $T$. The normalization divides a weighted sum of $n$ binomial random variables by the total number of ones. We show that the estimator is unbiased for any $p$ and enjoys low variance. When the row vectors of $M$ are drawn uniformly from a rank-$r$ factor model satisfying an incoherence condition, we prove that if $n \ge O({d r^5 ε^{-2} C^{-2} \log d})$, any local minimum of the gradient-descent objective is approximately global and recovers $T$ with error at most $ε^2$. Experiments on both synthetic and real-world data validate our approach. On three MovieLens datasets, our algorithm reduces bias by $88\%$ relative to baseline estimators. We also empirically validate the linear sampling complexity of $n$ relative to $d$ on synthetic data. On an Amazon reviews dataset with sparsity $10^{-7}$, our method reduces the recovery error of $T$ by $59\%$ and $M$ by $38\%$ compared to baseline methods.

翻译：矩阵补全是一个经典问题，在众多领域持续受到关注。本文重新审视超稀疏采样机制下的矩阵补全问题，其中未知的 $n\times d$ 矩阵 $M$（满足 $n \ge d$）的每个元素以独立概率 $p = C / d$ 被观测到，此处 $C \ge 2$ 为固定整数。该设定受到涉及大规模稀疏面板数据集应用的启发，此类数据集中行数远超过列数。当每行仅包含 $C$ 个元素（少于 $M$ 的秩）时，对 $M$ 的精确插补无法实现。因此，我们转而估计 $M$ 的行空间或其平均二阶矩矩阵 $T = M^{\top} M / n$。基于观测元素计算的经验二阶矩矩阵呈现出非随机且稀疏的缺失模式。我们提出一种无偏估计器，通过对二阶矩的每个非零元素按其观测频率进行归一化，随后采用梯度下降法对 $T$ 的缺失元素进行插补。该归一化过程将 $n$ 个二项随机变量的加权和除以观测到的非零元素总数。我们证明该估计器对任意 $p$ 均具有无偏性，且方差较低。当 $M$ 的行向量从满足非相干性条件的秩 $r$ 因子模型中均匀抽取时，我们证明若 $n \ge O({d r^5 ε^{-2} C^{-2} \log d})$，则梯度下降目标函数的任意局部最小值均近似为全局最小值，且能以不超过 $ε^2$ 的误差恢复 $T$。在合成数据与真实数据上的实验验证了我们的方法。在三个 MovieLens 数据集上，相较于基线估计器，我们的算法将偏差降低了 $88\%$。我们还在合成数据上实证验证了 $n$ 相对于 $d$ 的线性采样复杂度。在稀疏度为 $10^{-7}$ 的亚马逊评论数据集上，相较于基线方法，我们的方法将 $T$ 的恢复误差降低了 $59\%$，将 $M$ 的恢复误差降低了 $38\%$。