Natural systems with emergent behaviors often organize along low-dimensional subsets of high-dimensional spaces. For example, despite the tens of thousands of genes in the human genome, the principled study of genomics is fruitful because biological processes rely on coordinated organization that results in lower dimensional phenotypes. To uncover this organization, many nonlinear dimensionality reduction techniques have successfully embedded high-dimensional data into low-dimensional spaces by preserving local similarities between data points. However, the nonlinearities in these methods allow for too much curvature to preserve general trends across multiple non-neighboring data clusters, thereby limiting their interpretability and generalizability to out-of-distribution data. Here, we address both of these limitations by regularizing the curvature of manifolds generated by variational autoencoders, a process we coin ``$\Gamma$-VAE''. We demonstrate its utility using two example data sets: bulk RNA-seq from the The Cancer Genome Atlas (TCGA) and the Genotype Tissue Expression (GTEx); and single cell RNA-seq from a lineage tracing experiment in hematopoietic stem cell differentiation. We find that the resulting regularized manifolds identify mesoscale structure associated with different cancer cell types, and accurately re-embed tissues from completely unseen, out-of distribution cancers as if they were originally trained on them. Finally, we show that preserving long-range relationships to differentiated cells separates undifferentiated cells -- which have not yet specialized -- according to their eventual fate. Broadly, we anticipate that regularizing the curvature of generative models will enable more consistent, predictive, and generalizable models in any high-dimensional system with emergent low-dimensional behavior.
翻译:具有涌现行为的自然系统往往在高维空间的低维子集上组织。例如,尽管人类基因组中含有数万个基因,但基因组学的原理研究之所以富有成效,是因为生物过程依赖于协调组织,从而形成低维度表型。为了揭示这种组织,许多非线性降维技术通过保留数据点之间的局部相似性,成功地将高维数据嵌入到低维空间中。然而,这些方法中的非线性特性允许过多的曲率,从而无法保留跨多个非相邻数据簇的总体趋势,这限制了它们的可解释性以及对分布外数据的泛化能力。在此,我们通过正则化变分自编码器生成流形的曲率来解决这两个限制,这一过程我们称之为“$Γ$-VAE”。我们使用两个示例数据集证明了其效用:来自癌症基因组图谱(TCGA)和基因型组织表达(GTEx)的批量RNA-seq;以及来自造血干细胞分化谱系追踪实验的单细胞RNA-seq。我们发现,由此产生的正则化流形识别了与不同癌细胞类型相关的中尺度结构,并能够精确地将来自完全未见过的、分布外癌症的组织重新嵌入,仿佛它们最初就是基于这些数据训练的。最后,我们表明,保留与分化细胞的长程关系能够根据未分化细胞(尚未特化)的最终命运将其分离。广泛而言,我们预期,对生成模型曲率进行正则化将使任何具有涌现低维行为的高维系统能够产生更一致、更具预测性和更可泛化的模型。