Multimodal large language models (MLLMs) achieve strong results on visual reasoning benchmarks, but answer accuracy alone does not indicate whether a model relied on the correct visual evidence. This gap is particularly important in multi-view driving scenes used for autonomous driving, where a model can produce a plausible answer while grounding it in the wrong camera view. We introduce a multi-view visual question answering benchmark for evaluating evidence-source identification: given six synchronized NuScenes views and a question, the model must identify the supporting camera view and answer the question. The benchmark contains 122 conflict-centric question-answer pairs from 73 scenes, spanning causality, counterfactual reasoning, and intent prediction. View labels are proposed by an automatic conflict-mining pipeline and manually verified by annotators. We evaluate three settings: camera-view selection, oracle QA given the golden view, and joint prediction in which the model selects a view and answers in one pass. Answers are evaluated in both multiple-choice and free-form formats, using exact match for structured predictions and an LLM judge for free-form responses. By explicitly separating visual-source identification from answer correctness, the benchmark exposes grounding failures that answer-only evaluation misses.
翻译:多模态大语言模型(MLLMs)在视觉推理基准测试中取得了强劲表现,但仅凭答案准确性无法判断模型是否依赖了正确的视觉证据。这一缺陷在自动驾驶场景的多视图行车环境中尤为突出——模型可能给出看似合理的答案,却将推理依据错误地关联至其他摄像头视角。我们提出了一项多视图视觉问答基准测试,专门用于评估证据来源识别能力:给定六组同步的NuScenes视图及对应问题,模型必须识别出支撑性摄像头视角并回答该问题。该基准包含来自73个场景的122组以冲突为中心的问答对,涵盖因果推理、反事实推理和意图预测三类任务。视图标签由自动化冲突挖掘流程生成,并经人工标注员逐条校验。我们设计了三种评估设置:摄像头视角选择、基于黄金视角的先验知识问答、以及联合预测(模型单次推理同时完成视角选择与答案生成)。答案评估同时采用选择题与自由格式两种形式,结构化预测使用精确匹配指标,自由格式回答则借助大语言模型裁判进行判定。通过将视觉来源识别与答案正确性明确分离,本基准揭示了仅凭答案正确率无法捕获的鲁棒性缺陷。