Causal generative modeling is essential for developing reliable and transparent AI systems capable of counterfactual reasoning. While existing approaches focus on integrating causal constraints during the training of generative models, they often lack a unified framework to leverage the zero-shot reasoning capabilities of pretrained foundation models. We introduce FM-CGM, a modular framework for end-to-end visual causal reasoning using pretrained foundation models. FM-CGM formalizes the causal pipeline through three core components: a concept extractor, a concept manipulator, and a counterfactual generator. By leveraging a large reasoning model for causal inference and a text-to-image diffusion model for generation, our approach enables zero-shot causal discovery, intervention, and counterfactual generation. We then develop Causal Semantic Guidance (CSG), a cross-attention-based mechanism that ensures semantic interventions propagate to descendant concepts while preserving invariant regions. We empirically show that our approach can identify plausible causal structures and is suitable for faithful counterfactual image generation.
翻译:因果生成建模对于开发能够进行反事实推理的可靠且透明的人工智能系统至关重要。尽管现有方法侧重于在生成模型训练过程中整合因果约束,但它们往往缺乏一个统一的框架来利用预训练基础模型的零样本推理能力。我们提出了FM-CGM,一个使用预训练基础模型进行端到端视觉因果推理的模块化框架。FM-CGM通过三个核心组件形式化了因果管道:概念提取器、概念操纵器和反事实生成器。通过利用大型推理模型进行因果推断,并结合文本到图像的扩散模型进行生成,我们的方法实现了零样本的因果发现、干预和反事实生成。随后,我们开发了因果语义引导(CSG),一种基于交叉注意力的机制,确保语义干预能够传播到后代概念,同时保留不变区域。我们通过实验表明,我们的方法能够识别合理的因果结构,并且适用于保真度高的反事实图像生成。