We study the problem of learning a partially observed matrix under the low rank assumption in the presence of fully observed side information that depends linearly on the true underlying matrix. This problem consists of an important generalization of the Matrix Completion problem, a central problem in Statistics, Operations Research and Machine Learning, that arises in applications such as recommendation systems, signal processing, system identification and image denoising. We formalize this problem as an optimization problem with an objective that balances the strength of the fit of the reconstruction to the observed entries with the ability of the reconstruction to be predictive of the side information. We derive a mixed-projection reformulation of the resulting optimization problem and present a strong semidefinite cone relaxation. We design an efficient, scalable alternating direction method of multipliers algorithm that produces high quality feasible solutions to the problem of interest. Our numerical results demonstrate that in the small rank regime ({\color{black}$k \leq 10$}), our algorithm outputs solutions that achieve on average {\color{black}$2.3\%$} lower objective value and {\color{black}$41\%$} lower $\ell_2$ reconstruction error than the solutions returned by the best performing benchmark method on synthetic data. The runtime of our algorithm is competitive with and often superior to that of the benchmark methods. Our algorithm is able to solve problems with $n = 10000$ rows and $m = 10000$ columns in less than a minute. On large scale real world data, our algorithm produces solutions that achieve $67\%$ lower out of sample error than benchmark methods in $97\%$ less execution time.
翻译:我们研究了在存在完全观测的、与真实底层矩阵呈线性关系的辅助信息条件下,学习部分观测矩阵的低秩假设问题。该问题是矩阵补全问题的重要推广,后者作为统计学、运筹学和机器学习中的核心问题,在推荐系统、信号处理、系统辨识和图像去噪等应用中广泛出现。我们将此问题形式化为一个优化问题,其目标函数在重建结果对观测条目的拟合强度与重建结果对辅助信息的预测能力之间进行权衡。我们推导了该优化问题的混合投影重构形式,并提出了一个强半定锥松弛。我们设计了一种高效、可扩展的交替方向乘子法算法,能为目标问题生成高质量的可行解。数值结果表明,在小秩条件下($k \leq 10$),我们的算法所得解在合成数据上相比性能最佳的基准方法,平均实现了$2.3\%$更低的目标函数值和$41\%$更低的$\ell_2$重建误差。算法运行时间与基准方法相当且通常更优,能在不到一分钟内求解$n = 10000$行、$m = 10000$列规模的问题。在大规模实际数据上,我们的算法所得解相比基准方法实现了$67\%$更低的样本外误差,且运行时间减少了$97\%$。