Many modern datasets consist of multiple related matrices measured on a common set of units, where the goal is to recover the shared low-dimensional subspace. While the Angle-based Joint and Individual Variation Explained (AJIVE) framework provides a solution, it relies on equal-weight aggregation, which can be strictly suboptimal when views exhibit significant statistical heterogeneity (arising from varying SNR and dimensions) and structural heterogeneity (arising from individual components). In this paper, we propose HeteroJIVE, a weighted two-stage spectral algorithm tailored to such heterogeneity. Theoretically, we first revisit the ``non-diminishing" error barrier with respect to the number of views $K$ identified in recent literature for the equal-weight case. We demonstrate that this barrier is not universal: under generic geometric conditions, the bias term vanishes and our estimator achieves the $O(K^{-1/2})$ rate without the need for iterative refinement. Extending this to the general-weight case, we establish error bounds that explicitly disentangle the two layers of heterogeneity. Based on this, we derive an oracle-optimal weighting scheme implemented via a data-driven procedure. Extensive simulations corroborate our theoretical findings, and an application to TCGA-BRCA multi-omics data validates the superiority of HeteroJIVE in practice.
翻译:许多现代数据集由在相同观测单元上测量的多个相关矩阵构成,其目标在于恢复共享的低维子空间。尽管基于角度的联合与个体变异解释(AJIVE)框架提供了一种解决方案,但其依赖于等权重聚合,当各视图存在显著的统计异质性(源于不同的信噪比与维度)和结构异质性(源于个体成分)时,该方法可能严格次优。本文提出HeteroJIVE,一种针对此类异质性设计的加权两阶段谱算法。理论上,我们首先重新审视了近期文献中针对等权重情况所识别的关于视图数量$K$的“非衰减”误差壁垒。我们证明该壁垒并非普遍存在:在一般几何条件下,偏差项会消失,且我们的估计器无需迭代优化即可达到$O(K^{-1/2})$的收敛速率。将此推广至一般权重情形,我们建立了明确分离两层异质性的误差界。基于此,我们推导出通过数据驱动程序实现的Oracle最优加权方案。大量仿真实验验证了我们的理论发现,在TCGA-BRCA多组学数据上的应用证实了HeteroJIVE在实际中的优越性。