We present a unified framework for studying the identifiability of representations learned from simultaneously observed views, such as different data modalities. We allow a partially observed setting in which each view constitutes a nonlinear mixture of a subset of underlying latent variables, which can be causally related. We prove that the information shared across all subsets of any number of views can be learned up to a smooth bijection using contrastive learning and a single encoder per view. We also provide graphical criteria indicating which latent variables can be identified through a simple set of rules, which we refer to as identifiability algebra. Our general framework and theoretical results unify and extend several previous works on multi-view nonlinear ICA, disentanglement, and causal representation learning. We experimentally validate our claims on numerical, image, and multi-modal data sets. Further, we demonstrate that the performance of prior methods is recovered in different special cases of our setup. Overall, we find that access to multiple partial views enables us to identify a more fine-grained representation, under the generally milder assumption of partial observability.
翻译:我们提出一个统一框架,用于研究从同时观测的视角(如不同数据模态)中学习到的表示的可辨识性。我们允许部分可观测设置,其中每个视角构成潜在变量子集的非线性混合,这些潜在变量可能具有因果关系。我们证明,通过对比学习和每个视角的单一编码器,任何数量视角的所有子集间共享的信息可以被学习到光滑双射的程度。我们还提供图标准则,表明哪些潜在变量可以通过一组简单规则(称为可辨识性代数)被识别。我们的通用框架和理论结果统一并扩展了先前关于多视角非线性独立成分分析、解纠缠和因果表示学习的若干工作。我们在数值、图像和多模态数据集上实验验证了我们的主张。此外,我们展示了先前方法在我们设置的不同特例中恢复性能。总体而言,我们发现访问多个部分视角使我们能够在一般较温和的部分可观测性假设下识别更细粒度的表示。