We investigate a general matrix factorization for deviance-based data losses, extending the ubiquitous singular value decomposition beyond squared error loss. While similar approaches have been explored before, our method leverages classical statistical methodology from generalized linear models (GLMs) and provides an efficient algorithm that is flexible enough to allow for structural zeros via entry weights. Moreover, by adapting results from GLM theory, we provide support for these decompositions by (i) showing strong consistency under the GLM setup, (ii) checking the adequacy of a chosen exponential family via a generalized Hosmer-Lemeshow test, and (iii) determining the rank of the decomposition via a maximum eigenvalue gap method. To further support our findings, we conduct simulation studies to assess robustness to decomposition assumptions and extensive case studies using benchmark datasets from image face recognition, natural language processing, network analysis, and biomedical studies. Our theoretical and empirical results indicate that the proposed decomposition is more flexible, general, and robust, and can thus provide improved performance when compared to similar methods. To facilitate applications, an R package with efficient model fitting and family and rank determination is also provided.
翻译:我们研究了一种基于偏差数据损失的一般矩阵分解方法,将普适的奇异值分解扩展至平方误差损失之外。尽管已有类似方法,但我们的模型借鉴了广义线性模型(GLM)的经典统计方法,并提供了高效算法,该算法足够灵活,可通过条目权重容纳结构零值。此外,通过调整GLM理论结果,我们为这些分解提供了理论支持:(i)在GLM框架下证明强相合性,(ii)通过广义Hosmer-Lemeshow检验评估所选指数族的充分性,(iii)通过最大特征值间隙法确定分解的秩。为验证结果,我们开展了模拟研究以评估对分解假设的鲁棒性,并利用图像人脸识别、自然语言处理、网络分析和生物医学研究中的基准数据集进行了广泛案例研究。理论与实证结果表明,所提出的分解方法更灵活、通用且鲁棒,因此在与同类方法比较时能够提供更优性能。为促进应用,我们还提供了一个R语言包,包含高效模型拟合、族与秩确定功能。