Principal component analysis (PCA) is a tool to capture factors that explain variation in data. Across domains, data are now collected across multiple contexts (for example, individuals with different diseases, cells of different types, or words across texts). While the factors explaining variation in data are undoubtedly shared across subsets of contexts, no tools currently exist to systematically recover such factors. We develop multi-context principal component analysis (MCPCA), a theoretical and algorithmic framework that decomposes data into factors shared across subsets of contexts. Applied to gene expression, MCPCA reveals axes of variation shared across subsets of cancer types and an axis whose variability in tumor cells, but not mean, is associated with lung cancer progression. Applied to contextualized word embeddings from language models, MCPCA maps stages of a debate on human nature, revealing a discussion between science and fiction over decades. These axes are not found by combining data across contexts or by restricting to individual contexts. MCPCA is a principled generalization of PCA to address the challenge of understanding factors underlying data across contexts.
翻译:主成分分析(PCA)是一种用于捕捉解释数据变异因素的工具。在各领域中,数据现已在多种情境下收集(例如,患有不同疾病的个体、不同类型的细胞或跨文本的词汇)。尽管解释数据变异的因素无疑在不同情境子集间共享,但目前尚无系统性地恢复此类因素的工具。我们开发了多情境主成分分析(MCPCA),这是一个理论与算法框架,可将数据分解为跨情境子集共享的因素。应用于基因表达数据时,MCPCA揭示了跨癌症类型子集共享的变异轴,以及一个在肿瘤细胞中变异程度(而非均值)与肺癌进展相关的轴。应用于语言模型的情境化词嵌入时,MCPCA映射出关于人性辩论的各个阶段,揭示了数十年间科学与虚构之间的讨论。这些轴无法通过跨情境合并数据或局限于单一情境而发现。MCPCA是PCA的一种原则性推广,旨在应对理解跨情境数据背后因素的挑战。