When synthesizing multi-source high-dimensional data, a key objective is to extract low-dimensional representations that effectively approximate the original features across different sources. Such representations facilitate the discovery of transferable structures and help mitigate systematic biases such as batch effects. We introduce Stable Principal Component Analysis (StablePCA), a distributionally robust framework for constructing stable latent representations by maximizing the worst-case explained variance over multiple sources. A primary challenge in extending classical PCA to the multi-source setting lies in the nonconvex rank constraint, which renders the StablePCA formulation a nonconvex optimization problem. To overcome this challenge, we conduct a convex relaxation of StablePCA and develop an efficient Mirror-Prox algorithm to solve the relaxed problem, with global convergence guarantees. Since the relaxed problem generally differs from the original formulation, we further introduce a data-dependent certificate to assess how well the algorithm solves the original nonconvex problem and establish the condition under which the relaxation is tight. Finally, we explore alternative distributionally robust formulations of multi-source PCA based on different loss functions.
翻译:在合成多源高维数据时,一个关键目标是提取能够有效近似不同来源原始特征的低维表示。此类表示有助于发现可迁移结构,并帮助缓解批次效应等系统性偏差。本文提出稳定主成分分析(StablePCA),这是一个通过最大化多源数据最坏情况解释方差来构建稳定潜在表示的分布鲁棒框架。将经典PCA扩展到多源场景的主要挑战在于非凸秩约束,这使得StablePCA公式转化为非凸优化问题。为克服此挑战,我们对StablePCA进行凸松弛,并开发了具有全局收敛保证的高效镜像近端算法来求解松弛后问题。由于松弛问题通常与原公式存在差异,我们进一步引入数据依赖性证书来评估算法求解原始非凸问题的效果,并建立松弛紧致的条件。最后,我们探讨了基于不同损失函数的多源PCA替代性分布鲁棒公式。