Unsupervised integrative analysis of multiple data sources has become common place and scalable algorithms are necessary to accommodate ever increasing availability of data. Only few currently methods have estimation speed as their focus, and those that do are only applicable to restricted data layouts such as different data types measured on the same observation units. We introduce a novel point of view on low-rank matrix integration phrased as a graph estimation problem which allows development of a method, large-scale Collective Matrix Factorization (lsCMF), which is able to integrate data in flexible layouts in a speedy fashion. It utilizes a matrix denoising framework for rank estimation and geometric properties of singular vectors to efficiently integrate data. The quick estimation speed of lsCMF while retaining good estimation of data structure is then demonstrated in simulation studies.
翻译:多源数据的无监督集成分析已成为常态,亟需可扩展算法以应对日益增长的数据可得性。现有方法中仅少数注重估计速度,且这些方法仅适用于受限数据布局(如同一观测单元上测量的不同数据类型)。本文提出一种新颖的低秩矩阵集成视角,将其形式化为图估计问题,从而开发出大规模联合矩阵分解(lsCMF)方法,能够灵活快速地集成不同布局的数据。该方法利用矩阵去噪框架进行秩估计,并借助奇异向量的几何性质实现高效数据集成。仿真研究证明,lsCMF在保持良好数据结构估计性能的同时,具有卓越的估计速度。