AIM-CoT: Active Information-driven Multimodal Chain-of-Thought for Vision-Language Reasoning

Interleaved-Modal Chain-of-Thought (I-MCoT) advances vision-language reasoning, such as Visual Question Answering (VQA). This paradigm integrates specially selected visual evidence from the input image into the context of Vision-Language Models (VLMs), enabling them to ground their reasoning logic in these details. Accordingly, the efficacy of an I-MCoT framework relies on identifying what to see (evidence selection) and when to see it (triggering of insertions). However, existing methods fall short in both aspects. First, for selection, they rely on attention signals, which are unreliable -- particularly under severe granularity imbalance between the brief textual query and the informative image. Second, for triggering, they adopt static triggers, which fail to capture the VLMs' dynamic needs for visual evidence. To this end, we propose a novel I-MCoT framework, Active Information-driven Multi-modal Chain-of-Thought (AIM-CoT), which aims to improve both evidence selection and insertion triggering via: (1) Context-enhanced Attention-map Generation (CAG) to mitigate granularity imbalance via textual context enhancement; (2) Active Visual Probing (AVP) to proactively select the most informative evidence via an information foraging process; and (3) Dynamic Attention-shift Trigger (DAT) to precisely activate insertions when VLM's attention shifts from text to visual context. Experiments across three benchmarks and four backbones demonstrate AIM-CoT's consistent superiority. Our code is available at https://anonymous.4open.science/r/AIMCoT.

翻译：交错模态思维链（I-MCoT）推动了视觉-语言推理任务（如视觉问答VQA）的发展。该范式将输入图像中经特殊筛选的视觉证据融入视觉-语言模型（VLM）的上下文，使其推理逻辑基于这些细节。因此，I-MCoT框架的有效性取决于“看什么”（证据选择）与“何时看”（插入触发）。然而，现有方法在这两方面均存在不足。第一，在证据选择上，现有方法依赖注意力信号，这种信号不可靠——尤其是在简短的文本查询与信息丰富的图像之间存在严重粒度失衡时。第二，在触发机制上，现有方法采用静态触发器，无法捕捉VLM对视觉证据的动态需求。为此，我们提出一种新型I-MCoT框架——主动信息驱动多模态思维链（AIM-CoT），旨在通过以下三点改进证据选择与插入触发：（1）上下文增强注意力图生成（CAG），通过文本上下文增强缓解粒度失衡；（2）主动视觉探测（AVP），通过信息搜寻过程主动选择最富信息的证据；（3）动态注意力偏移触发器（DAT），在VLM注意力从文本转向视觉上下文时精确激活插入操作。在三个基准数据集与四种骨干网络上的实验表明，AIM-CoT具有持续优越性。我们的代码已开源：https://anonymous.4open.science/r/AIMCoT。