We present a unified framework for studying the identifiability of representations learned from simultaneously observed views, such as different data modalities. We allow a partially observed setting in which each view constitutes a nonlinear mixture of a subset of underlying latent variables, which can be causally related. We prove that the information shared across all subsets of any number of views can be learned up to a smooth bijection using contrastive learning and a single encoder per view. We also provide graphical criteria indicating which latent variables can be identified through a simple set of rules, which we refer to as identifiability algebra. Our general framework and theoretical results unify and extend several previous works on multi-view nonlinear ICA, disentanglement, and causal representation learning. We experimentally validate our claims on numerical, image, and multi-modal data sets. Further, we demonstrate that the performance of prior methods is recovered in different special cases of our setup. Overall, we find that access to multiple partial views enables us to identify a more fine-grained representation, under the generally milder assumption of partial observability.
翻译:我们提出了一个统一框架,用于研究从同时观测的视角(如不同数据模态)中学习到的表示的可辨识性。我们允许一种部分可观测的设置,其中每个视角由底层潜变量子集的非线性混合构成,这些潜变量之间可能存在因果关系。我们证明,通过对比学习和每个视角的单一编码器,任意数量视角的所有子集所共享的信息可被学习至光滑双射。我们还提供了图准则,用于指示哪些潜变量可以通过一组简单规则(我们称之为可辨识性代数)被识别。我们的通用框架和理论结果统一并扩展了先前关于多视角非线性独立成分分析、解耦和因果表示学习的多项工作。我们通过数值实验、图像实验和多模态数据集验证了我们的论断。此外,我们证明在框架的不同特例中,先前方法的性能可被复现。总体而言,我们发现,在通常更宽松的部分可观测性假设下,访问多个部分视角使我们能够识别更精细的表示。