Recent years have seen a surge of interest in learning high-level causal representations from low-level image pairs under interventions. Yet, existing efforts are largely limited to simple synthetic settings that are far away from real-world problems. In this paper, we present Causal Triplet, a causal representation learning benchmark featuring not only visually more complex scenes, but also two crucial desiderata commonly overlooked in previous works: (i) an actionable counterfactual setting, where only certain object-level variables allow for counterfactual observations whereas others do not; (ii) an interventional downstream task with an emphasis on out-of-distribution robustness from the independent causal mechanisms principle. Through extensive experiments, we find that models built with the knowledge of disentangled or object-centric representations significantly outperform their distributed counterparts. However, recent causal representation learning methods still struggle to identify such latent structures, indicating substantial challenges and opportunities for future work. Our code and datasets will be available at https://sites.google.com/view/causaltriplet.
翻译:近年来,从干预下的低层级图像对中学习高层级因果表征引起了广泛关注。然而,现有工作大多局限于远离真实世界问题的简单合成场景。本文提出因果三元组(Causal Triplet)这一因果表征学习基准,不仅包含视觉上更复杂的场景,还涵盖以往工作常忽视的两个关键需求:(i)可操作反事实设定——仅部分对象层级变量允许反事实观测,而其他变量则不允许;(ii)基于独立因果机制原则、强调分布外鲁棒性的干预式下游任务。通过大量实验发现,基于解耦或面向对象表征构建的模型显著优于分布式表征模型。然而,现有因果表征学习方法仍难以识别此类隐结构,这为未来研究提出了重大挑战与机遇。我们的代码与数据集将发布于 https://sites.google.com/view/causaltriplet。