Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering

Knowledge-based visual question answering (KB-VQA) lets vision-language systems answer questions that exceed their parametric knowledge by conditioning a reader on passages retrieved from a Wikipedia-scale knowledge base. In pure-text long-context LLMs, retrieved-context use follows the U-shaped "lost-in-the-middle" effect of Liu et al. (2024): information at the start and end of context is used, the middle is lost. Whether this transfers to deployed multimodal KB-VQA is open. To close this gap, we design the first controlled probe of reader-side position dependence in multimodal KB-VQA: a gold-position protocol in which only the gold passage's prompt slot varies within question. We run it on three open-source 7B/8B VLM readers and two KB-VQA benchmarks at k up to 20. The shape flips from U to primacy: gold-at-first beats gold-at-last by 16 to 26 points on every reader-by-benchmark cell, an effect we call "Lost at the End". Three targeted ablations narrow the cause: a text-only control shows the multimodal setting amplifies an already-present text-mode primacy 2.2 to 4.5 times, and image-position and distractor-shuffle ablations together pin the locus to prompt slot 0 of the instruction-tuned reader. On a frozen reader, three retrieval-side fixes (MMR, oracle reranking, rank-based reordering) all leave the gap intact (no separable improvement). Our findings indicate that recall@k is the wrong metric for deployed KB-VQA and that closing the gap requires reader-side intervention; we release our protocol as a controlled instrument for evaluating such interventions.

翻译：基于知识的视觉问答（KB-VQA）通过让阅读器从维基百科规模的知识库中检索的段落中获取信息，使视觉-语言系统能够回答超出其参数知识范围的问题。在纯文本长上下文的大语言模型中，检索上下文的使用遵循Liu等人（2024年）提出的U形“迷失于中间”效应：上下文开头和结尾的信息被利用，而中间部分被丢失。这种效应是否会转移到部署的多模态KB-VQA中尚不清楚。为弥补这一空白，我们设计了首个对多模态KB-VQA中阅读器侧位置依赖性的受控探查：一种黄金位置协议，其中只有黄金段落的提示槽在问题内变化。我们在三个开源7B/8B视觉-语言模型阅读器和两个KB-VQA基准测试上进行了实验，k值最高达20。形状从U形翻转为首因效应：在每一个阅读器-基准测试组合中，黄金位于开头比位于结尾表现高出16到26个点，我们将这种效应称为“迷失于末尾”。三项针对性消融实验缩小了原因范围：纯文本对照显示，多模态设置将已存在的文本模式首因效应放大了2.2到4.5倍；图像位置和干扰项洗牌消融实验共同将定位点锁定在指令调优阅读器的提示槽0。在冻结的阅读器上，三种检索侧改进方法（MMR、预言机重排序、基于排名的重排序）均未缩小差距（无显著改善）。我们的发现表明，recall@k是部署的KB-VQA中的错误指标，且缩小差距需要阅读器侧干预；我们发布协议作为评估此类干预的受控工具。