Multimodal in-context learning (ICL) has emerged as a key mechanism for harnessing the capabilities of large vision-language models (LVLMs). However, its effectiveness remains highly sensitive to the quality of input ICL sequences, particularly for tasks involving complex reasoning or open-ended generation. A major limitation is our limited understanding of how LVLMs actually exploit these sequences during inference. To bridge this gap, we systematically interpret multimodal ICL through the lens of task mapping, which reveals how local and global relationships within and among demonstrations guide model reasoning. Building on this insight, we present TACO, a lightweight transformer-based model equipped with task-aware attention that dynamically configures ICL sequences. By injecting task-mapping signals into the autoregressive decoding process, TACO creates a bidirectional synergy between sequence construction and task reasoning. Experiments on five LVLMs and nine datasets demonstrate that TACO consistently surpasses baselines across diverse ICL tasks. These results position task mapping as a novel and valuable perspective for interpreting and improving multimodal ICL.
翻译:多模态上下文学习已成为利用大型视觉语言模型能力的关键机制。然而,其有效性仍高度依赖于输入ICL序列的质量,尤其是在涉及复杂推理或开放式生成的任务中。一个主要局限在于我们对LVLM在推理过程中如何实际利用这些序列的理解有限。为弥补这一差距,我们通过任务映射的视角系统性地阐释多模态ICL,揭示了示范样本内部及样本间的局部与全局关系如何引导模型推理。基于这一洞见,我们提出了TACO——一个配备任务感知注意力的轻量级基于Transformer的模型,可动态配置ICL序列。通过将任务映射信号注入自回归解码过程,TACO在序列构建与任务推理之间建立了双向协同机制。在五个LVLM和九个数据集上的实验表明,TACO在多样化ICL任务中持续超越基线方法。这些成果将任务映射确立为阐释和改进多模态ICL的新颖且具有价值的理论视角。