Visual abductive reasoning (VAR) is a challenging task that requires AI systems to infer the most likely explanation for incomplete visual observations. While recent MLLMs develop strong general-purpose multimodal reasoning capabilities, they fall short in abductive inference, as compared to human beings. To bridge this gap, we draw inspiration from the interplay between verbal and pictorial abduction in human cognition, and propose to strengthen abduction of MLLMs by mimicking such dual-mode behavior. Concretely, we introduce AbductiveMLLM comprising of two synergistic components: REASONER and IMAGINER. The REASONER operates in the verbal domain. It first explores a broad space of possible explanations using a blind LLM and then prunes visually incongruent hypotheses based on cross-modal causal alignment. The remaining hypotheses are introduced into the MLLM as targeted priors, steering its reasoning toward causally coherent explanations. The IMAGINER, on the other hand, further guides MLLMs by emulating human-like pictorial thinking. It conditions a text-to-image diffusion model on both the input video and the REASONER's output embeddings to "imagine" plausible visual scenes that correspond to verbal explanation, thereby enriching MLLMs' contextual grounding. The two components are trained jointly in an end-to-end manner. Experiments on standard VAR benchmarks show that AbductiveMLLM achieves state-of-the-art performance, consistently outperforming traditional solutions and advanced MLLMs.
翻译:视觉溯因推理(VAR)是一项具有挑战性的任务,要求人工智能系统从不完整的视觉观察中推断出最可能的解释。尽管近期的多模态大语言模型(MLLMs)已发展出强大的通用多模态推理能力,但与人类相比,其在溯因推理方面仍存在不足。为弥合这一差距,我们受人类认知中言语溯因与图像溯因相互作用的启发,提出通过模拟这种双模态行为来增强MLLMs的溯因能力。具体而言,我们引入了AbductiveMLLM,它包含两个协同组件:REASONER与IMAGINER。REASONER在言语域中运作:首先利用一个盲语言模型(blind LLM)探索广泛的可能解释空间,随后基于跨模态因果对齐剪除视觉上不一致的假设;剩余的假设作为定向先验引入MLLM,引导其推理朝向因果一致的解。另一方面,IMAGINER通过模拟人类图像化思维进一步引导MLLMs:它基于输入视频与REASONER的输出嵌入,条件化一个文本到图像的扩散模型,以“想象”与言语解释相对应的合理视觉场景,从而丰富MLLMs的上下文基础。两个组件以端到端方式联合训练。在标准VAR基准上的实验表明,AbductiveMLLM取得了最先进的性能,持续优于传统解决方案与先进的MLLMs。