In various practical situations, matrix factorization methods suffer from poor data quality, such as high data sparsity and low signal-to-noise ratio (SNR). Here, we consider a matrix factorization problem by utilizing auxiliary information, which is massively available in real-world applications, to overcome the challenges caused by poor data quality. Unlike existing methods that mainly rely on simple linear models to combine auxiliary information with the main data matrix, we propose to integrate gradient boosted trees in the probabilistic matrix factorization framework to effectively leverage auxiliary information (MFAI). Thus, MFAI naturally inherits several salient features of gradient boosted trees, such as the capability of flexibly modeling nonlinear relationships and robustness to irrelevant features and missing values in auxiliary information. The parameters in MFAI can be automatically determined under the empirical Bayes framework, making it adaptive to the utilization of auxiliary information and immune to overfitting. Moreover, MFAI is computationally efficient and scalable to large datasets by exploiting variational inference. We demonstrate the advantages of MFAI through comprehensive numerical results from simulation studies and real data analyses. Our approach is implemented in the R package mfair available at https://github.com/YangLabHKUST/mfair.
翻译:在实际场景中,矩阵分解方法常因数据质量低下(如高稀疏性和低信噪比)而表现不佳。本文考虑利用现实应用中广泛可用的辅助信息来应对数据质量挑战的矩阵分解问题。与现有方法主要依赖简单线性模型将辅助信息与主数据矩阵结合不同,我们提出在概率矩阵分解框架中集成梯度提升树,以有效利用辅助信息(MFAI)。因此,MFAI自然继承了梯度提升树的若干显著特性,包括灵活建模非线性关系的能力、对辅助信息中无关特征和缺失值的鲁棒性。在经验贝叶斯框架下,MFAI的参数可自动确定,使其能自适应地利用辅助信息并避免过拟合。此外,通过变分推断,MFAI计算高效且可扩展至大规模数据集。我们通过模拟研究和真实数据分析的综合数值结果证明了MFAI的优势。该方法已在R包mfair中实现,可通过https://github.com/YangLabHKUST/mfair获取。