This article focuses on covariance estimation for multi-study data. Popular approaches employ factor-analytic terms with shared and study-specific loadings that decompose the variance into (i) a shared low-rank component, (ii) study-specific low-rank components, and (iii) a diagonal term capturing idiosyncratic variability. Our proposed methodology estimates the latent factors via spectral decompositions, with a novel approach for separating shared and specific factors, and infers the factor loadings and residual variances via surrogate Bayesian regressions. The resulting posterior has a simple product form across outcomes, bypassing the need for Markov chain Monte Carlo sampling and facilitating parallelization. The proposed methodology has major advantages over current Bayesian competitors in terms of computational speed, scalability and stability while also having strong frequentist guarantees. The theory and methods also add to the rich literature on frequentist methods for factor models with shared and group-specific components of variation. The approximation error decreases as the sample size and the data dimension diverge, formalizing a blessing of dimensionality. We show favorable asymptotic properties, including central limit theorems for point estimators and posterior contraction, and excellent empirical performance in simulations. The methods are applied to integrate three studies on gene associations among immune cells.
翻译:本文聚焦于多研究数据的协方差估计。主流方法采用具有共享载荷和研究特定载荷的因子分析项,将方差分解为:(i) 共享低秩分量,(ii) 研究特定低秩分量,以及 (iii) 捕捉特异变异性的对角项。我们提出的方法通过谱分解估计潜在因子,采用一种新颖的方法分离共享因子和特定因子,并通过替代贝叶斯回归推断因子载荷与残差方差。所得后验分布具有跨结果的简单乘积形式,绕过了马尔可夫链蒙特卡罗采样的需要,便于并行化。所提方法在计算速度、可扩展性和稳定性方面相较于现有贝叶斯竞争方法具有显著优势,同时具备强大的频率学派保证。该理论与方法也丰富了关于具有共享和组别特定变异分量的因子模型的频率学派方法文献。近似误差随样本量和数据维度的发散而减小,形式化地体现了“维度祝福”。我们展示了良好的渐近性质,包括点估计量的中心极限定理和后验收缩,以及在仿真中优异的实证性能。该方法被应用于整合三项关于免疫细胞间基因关联的研究。