Visual causal reasoning is essential for understanding and intervening in the physical world, requiring identification of causal variables from visual inputs and reasoning over intervention effects. Despite recent progress, large vision--language models (VLMs) remain brittle at such tasks, especially for interventional and counterfactual queries over multi-image inputs. Most existing explorations inject causal knowledge via textual prompts, leaving causal mechanisms external to model execution and limiting reliable control during inference. To address this problem, we propose BridgeVLM, which internalizes visual causal reasoning by inducing a causal graph from multi-image inputs and converting it into structured Causal Tokens executed by RAMP layers injected into the LLM decoder for causal message passing. We further introduce a unified training interface M3S for fine-grained causal supervision from different granularities (local/global level). BridgeVLM achieves 54.4% accuracy on intervention tasks on CausalVLBench (vs. 33.2% with prompt-level supervision), improves results on Causal3D from 43.6% to 49.0%, and substantially improves causal structure learning on CausalVLBench ($F_1$: 33.4% $\rightarrow$ 75.1%).
翻译:视觉因果推理对于理解和干预物理世界至关重要,需要从视觉输入中识别因果变量并对干预效果进行推理。尽管近期取得了进展,大型视觉-语言模型(VLMs)在此类任务上仍显脆弱,尤其是针对多图像输入的干预性和反事实查询。现有的大多数探索通过文本提示注入因果知识,使得因果机制停留在模型执行外部,限制了推理过程中的可靠控制。为解决此问题,我们提出BridgeVLM,通过从多图像输入中诱导因果图并将其转换为结构化因果令牌,由注入LLM解码器的RAMP层执行因果消息传递,从而内化视觉因果推理。我们进一步引入统一训练接口M3S,以实现来自不同粒度(局部/全局级别)的细粒度因果监督。BridgeVLM在CausalVLBench的干预任务上达到54.4%的准确率(相比之下,基于提示的监督为33.2%),在Causal3D上的结果从43.6%提升至49.0%,并显著改善了CausalVLBench上的因果结构学习($F_1$:33.4% $\rightarrow$ 75.1%)。