We study the problem of learning a partially observed matrix under the low rank assumption in the presence of fully observed side information that depends linearly on the true underlying matrix. This problem consists of an important generalization of the Matrix Completion problem, a central problem in Statistics, Operations Research and Machine Learning, that arises in applications such as recommendation systems, signal processing, system identification and image denoising. We formalize this problem as an optimization problem with an objective that balances the strength of the fit of the reconstruction to the observed entries with the ability of the reconstruction to be predictive of the side information. We derive a mixed-projection reformulation of the resulting optimization problem and present a strong semidefinite cone relaxation. We design an efficient, scalable alternating direction method of multipliers algorithm that produces high quality feasible solutions to the problem of interest. Our numerical results demonstrate that in the small rank regime ($k \leq 15$), our algorithm outputs solutions that achieve on average $79\%$ lower objective value and $90.1\%$ lower $\ell_2$ reconstruction error than the solutions returned by the best performing benchmark method on synthetic data. The runtime of our algorithm is competitive with and often superior to that of the benchmark methods. Our algorithm is able to solve problems with $n = 10000$ rows and $m = 10000$ columns in less than a minute. On large scale real world data, our algorithm produces solutions that achieve $67\%$ lower out of sample error than benchmark methods in $97\%$ less execution time.
翻译:本文研究在存在完全观测的侧信息(该信息线性依赖于真实底层矩阵)的情况下,学习部分观测矩阵的低秩假设问题。该问题是矩阵补全问题的重要推广,后者作为统计学、运筹学和机器学习中的核心问题,在推荐系统、信号处理、系统辨识和图像去噪等应用中广泛出现。我们将此问题形式化为一个优化问题,其目标函数在重建结果对观测条目的拟合强度与重建结果对侧信息的预测能力之间进行权衡。我们推导了所得优化问题的混合投影重构形式,并提出了一种强半定锥松弛。我们设计了一种高效、可扩展的交替方向乘子法算法,能为目标问题生成高质量的可行解。数值结果表明,在低秩条件下($k \leq 15$),我们的算法所得解在合成数据上,相较于性能最佳的基准方法返回的解,平均实现了目标函数值降低$79\%$、$\ell_2$重建误差降低$90.1\%$的效果。我们算法的运行时间与基准方法相比具有竞争力,且通常更优。该算法能在不到一分钟内求解具有$n = 10000$行和$m = 10000$列的大规模问题。在真实世界的大规模数据上,我们的算法所得解实现了比基准方法低$67\%$的样本外误差,且执行时间减少了$97\%$。