Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving

Multimodal large language models (MLLMs) achieve strong results on visual reasoning benchmarks, but answer accuracy alone does not indicate whether a model relied on the correct visual evidence. This gap is particularly important in multi-view driving scenes used for autonomous driving, where a model can produce a plausible answer while grounding it in the wrong camera view. We introduce a multi-view visual question answering benchmark for evaluating evidence-source identification: given six synchronized NuScenes views and a question, the model must identify the supporting camera view and answer the question. The benchmark contains 122 conflict-centric question-answer pairs from 73 scenes, spanning causality, counterfactual reasoning, and intent prediction. View labels are proposed by an automatic conflict-mining pipeline and manually verified by annotators. We evaluate three settings: camera-view selection, oracle QA given the golden view, and joint prediction in which the model selects a view and answers in one pass. Answers are evaluated in both multiple-choice and free-form formats, using exact match for structured predictions and an LLM judge for free-form responses. By explicitly separating visual-source identification from answer correctness, the benchmark exposes grounding failures that answer-only evaluation misses.

翻译：多模态大语言模型（MLLMs）在视觉推理基准测试中取得了强劲表现，但仅凭答案准确性无法判断模型是否依赖了正确的视觉证据。这一缺陷在自动驾驶场景的多视图行车环境中尤为突出——模型可能给出看似合理的答案，却将推理依据错误地关联至其他摄像头视角。我们提出了一项多视图视觉问答基准测试，专门用于评估证据来源识别能力：给定六组同步的NuScenes视图及对应问题，模型必须识别出支撑性摄像头视角并回答该问题。该基准包含来自73个场景的122组以冲突为中心的问答对，涵盖因果推理、反事实推理和意图预测三类任务。视图标签由自动化冲突挖掘流程生成，并经人工标注员逐条校验。我们设计了三种评估设置：摄像头视角选择、基于黄金视角的先验知识问答、以及联合预测（模型单次推理同时完成视角选择与答案生成）。答案评估同时采用选择题与自由格式两种形式，结构化预测使用精确匹配指标，自由格式回答则借助大语言模型裁判进行判定。通过将视觉来源识别与答案正确性明确分离，本基准揭示了仅凭答案正确率无法捕获的鲁棒性缺陷。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

从感知到认知：多模态大语言模型中视觉-语言交互推理综述

专知会员服务

32+阅读 · 2025年10月1日

多模态幻觉的评估与检测综述

专知会员服务

18+阅读 · 2025年7月28日

大规模视觉-语言模型的基准、评估、应用与挑战

专知会员服务

18+阅读 · 2025年2月10日

《多模态大语言模型视觉提示》综述

专知会员服务

36+阅读 · 2024年9月25日