In various practical situations, matrix factorization methods suffer from poor data quality, such as high data sparsity and low signal-to-noise ratio (SNR). Here, we consider a matrix factorization problem by utilizing auxiliary information, which is massively available in real-world applications, to overcome the challenges caused by poor data quality. Unlike existing methods that mainly rely on simple linear models to combine auxiliary information with the main data matrix, we propose to integrate gradient boosted trees in the probabilistic matrix factorization framework to effectively leverage auxiliary information (MFAI). Thus, MFAI naturally inherits several salient features of gradient boosted trees, such as the capability of flexibly modeling nonlinear relationships and robustness to irrelevant features and missing values in auxiliary information. The parameters in MFAI can be automatically determined under the empirical Bayes framework, making it adaptive to the utilization of auxiliary information and immune to overfitting. Moreover, MFAI is computationally efficient and scalable to large datasets by exploiting variational inference. We demonstrate the advantages of MFAI through comprehensive numerical results from simulation studies and real data analyses. Our approach is implemented in the R package mfair available at https://github.com/YangLabHKUST/mfair.
翻译:在各种实际应用中,矩阵分解方法常受数据质量不佳的困扰,例如数据高度稀疏和信噪比(SNR)较低。本文探讨如何利用现实应用中大量存在的辅助信息来解决因数据质量不佳带来的挑战,从而改进矩阵分解问题。与现有方法主要依赖简单线性模型将辅助信息与主数据矩阵结合不同,我们提出在概率矩阵分解框架中集成梯度提升树,以有效利用辅助信息(MFAI)。因此,MFAI自然继承了梯度提升树的若干显著特性,例如能够灵活建模非线性关系,以及对辅助信息中的无关特征和缺失值具有鲁棒性。MFAI中的参数可在经验贝叶斯框架下自动确定,使其能自适应地利用辅助信息并避免过拟合。此外,通过采用变分推断,MFAI计算高效且可扩展至大规模数据集。我们通过模拟研究和实际数据分析的综合数值结果展示了MFAI的优势。本方法已在R包mfair中实现,可通过 https://github.com/YangLabHKUST/mfair 获取。