Causal representation learning aims to infer the high-level latent causal concepts that give rise to observed low-level measurements. This is particularly relevant for heterogeneous data from different environments or domains since distribution shifts often arise through sparse, localized changes in some of the underlying causal mechanisms, while other parts of the generative process remain unchanged. Whereas identifiability of causal representations has been studied extensively, practical uncertainty-aware methods and real-world use cases remain less explored. In this work, we propose a Bayesian approach to learning causal representations from multi-environment data, focusing on the case of discrete causal concepts and unknown multi-node soft interventions. To this end, we translate causal assumptions and interpretability desiderata into suitable priors and parametric choices within a hierarchical model. We then devise an inference scheme based on sequential Monte Carlo sampling to approximate the resulting multimodal posterior. We showcase our approach through case studies on social survey data, where latent causal concepts correspond to cultural values or political opinions, measurements to survey responses, and environments to different countries or states. Our model infers meaningful high-level concepts and plausible causal relations among them, demonstrating its utility for learning causal representations of complex real-world data.
翻译:因果表征学习旨在推断产生可观测低层测量的高层潜在因果概念。该方法对来自不同环境或领域的异质数据尤为适用,因为分布偏移通常源于部分底层因果机制中稀疏的局部变化,而生成过程的其他部分保持不变。尽管因果表征的可识别性已得到广泛研究,但实际应用中的不确定性感知方法及真实场景案例仍鲜有探索。本文提出一种基于贝叶斯方法的多环境数据因果表征学习框架,专注于离散因果概念与未知多节点软干预情形。为此,我们通过层级模型将因果假设与可解释性需求转化为适当的先验分布与参数选择,进而设计基于序贯蒙特卡洛采样的推理方案以逼近所得多模态后验分布。我们通过社会调查数据案例研究验证该方法:其中潜在因果概念对应文化价值观或政治观点,测量数据对应问卷响应,环境对应不同国家或地区。该模型能推断出具有意义的高层概念及其间合理的因果关联,充分展现了其在复杂真实数据中学习因果表征的实用价值。