Directed Acyclic Graphs (DAGs) are a standard tool in causal modeling, but their suitability for capturing the complexity of large-scale multimodal data is questionable. In practice, real-world multimodal datasets are often collected from heterogeneous generative processes that do not conform to a single DAG. Instead, they may involve multiple, and even opposing, DAG structures with inverse causal directions. To address this gap, in this work, we first propose a novel latent partial causal model tailored for multimodal data representation learning, featuring two latent coupled variables parts connected by an undirected edge, to represent the transfer of knowledge across modalities. Under specific statistical assumptions, we establish an identifiability result, demonstrating that representations learned by MultiModal Contrastive Learning (MMCL) correspond to the latent coupled variables up to a trivial transformation. This result deepens our understanding of the why MMCL works, highlights its potential for representation disentanglement, and expands the utility of pre-trained models like CLIP. Synthetic experiments confirm the robustness of our findings, even when the assumptions are partially violated. Most importantly, experiments on a pre-trained CLIP model embodies disentangled representations, enabling few-shot learning and improving domain generalization across diverse real-world datasets. Together, these contributions push the boundaries of MMCL, both in theory and in practical applications.
翻译:有向无环图(DAGs)是因果建模中的标准工具,但其是否足以捕捉大规模多模态数据的复杂性仍存疑。实际上,现实世界的多模态数据集通常来源于异质的生成过程,这些过程并不符合单一的有向无环图结构。相反,它们可能涉及多个甚至具有反向因果方向的对立有向无环图结构。为填补这一空白,本研究首先提出一种新颖的隐式部分因果模型,专为多模态数据表示学习设计,该模型包含两个通过无向边连接的隐式耦合变量部分,用以表征跨模态的知识传递。在特定的统计假设下,我们建立了一个可识别性结果,证明通过多模态对比学习(MMCL)学得的表示对应于隐式耦合变量,仅相差一个平凡变换。这一结果深化了我们对MMCL为何有效的理解,突显了其在表示解耦方面的潜力,并扩展了如CLIP等预训练模型的实用性。合成实验验证了我们发现的鲁棒性,即使在假设部分违反的情况下亦然。最重要的是,在预训练的CLIP模型上进行的实验体现了其解耦的表示能力,实现了少样本学习,并在多样化的现实世界数据集上提升了领域泛化性能。综上所述,这些贡献在理论和实际应用两方面推动了多模态对比学习的边界。