When synthesizing multi-source high-dimensional data, a key objective is to extract low-dimensional representations that effectively approximate the original features across different sources. Such representations facilitate the discovery of transferable structures and help mitigate systematic biases such as batch effects. We introduce Stable Principal Component Analysis (StablePCA), a distributionally robust framework for constructing stable latent representations by maximizing the worst-case explained variance over multiple sources. A primary challenge in extending classical PCA to the multi-source setting lies in the nonconvex rank constraint, which renders the StablePCA formulation a nonconvex optimization problem. To overcome this challenge, we conduct a convex relaxation of StablePCA and develop an efficient Mirror-Prox algorithm to solve the relaxed problem, with global convergence guarantees. Since the relaxed problem generally differs from the original formulation, we further introduce a data-dependent certificate to assess how well the algorithm solves the original nonconvex problem and establish the condition under which the relaxation is tight. Finally, we explore alternative distributionally robust formulations of multi-source PCA based on different loss functions.
翻译:在合成多源高维数据时,一个核心目标是提取能够有效近似不同数据源原始特征的低维表示。此类表示有助于发现可迁移的结构,并帮助缓解批次效应等系统性偏差。本文提出稳定主成分分析(StablePCA),这是一个分布鲁棒的框架,通过最大化多源数据最坏情况下的解释方差来构建稳定的潜在表示。将经典PCA扩展到多源场景的主要挑战在于非凸的秩约束,这使得StablePCA公式化为一个非凸优化问题。为克服此挑战,我们对StablePCA进行凸松弛,并开发了一种高效的镜像-近端算法来求解松弛后的问题,该算法具有全局收敛性保证。由于松弛问题通常与原问题存在差异,我们进一步引入了一种数据依赖的证书来评估算法求解原始非凸问题的效果,并建立了松弛紧致的条件。最后,我们探索了基于不同损失函数的多源PCA的替代性分布鲁棒公式。