Daily scenarios are characterized by visual richness, requiring Multimodal Large Language Models (MLLMs) to filter noise and identify decisive visual clues for accurate reasoning. Yet, current benchmarks predominantly aim at evaluating MLLMs' pre-existing knowledge or perceptual understanding, often neglecting the critical capability of reasoning. To bridge this gap, we introduce DailyClue, a benchmark designed for visual clue-driven reasoning in daily scenarios. Our construction is guided by two core principles: (1) strict grounding in authentic daily activities, and (2) challenging query design that necessitates more than surface-level perception. Instead of simple recognition, our questions compel MLLMs to actively explore suitable visual clues and leverage them for subsequent reasoning. To this end, we curate a comprehensive dataset spanning four major daily domains and 16 distinct subtasks. Comprehensive evaluation across MLLMs and agentic models underscores the formidable challenge posed by our benchmark. Our analysis reveals several critical insights, emphasizing that the accurate identification of visual clues is essential for robust reasoning.
翻译:日常场景以视觉丰富性为特征,要求多模态大语言模型(MLLMs)过滤噪声并识别关键视觉线索以实现精准推理。然而,现有基准主要侧重于评估MLLMs的既有知识或感知理解能力,往往忽略了关键的推理能力。为弥补这一空白,我们提出DailyClue——一个面向日常场景中视觉线索驱动推理的基准测试。其构建遵循两大核心原则:(1)严格基于真实日常活动;(2)设计超越表层感知的挑战性查询。我们的问题并非简单的识别任务,而是迫使MLLMs主动探索合适的视觉线索,并利用这些线索进行后续推理。为此,我们策划了一个涵盖四大日常领域和16项不同子任务的综合数据集。对MLLMs及代理模型的全面评估凸显了该基准测试的严峻挑战。分析结果揭示了若干关键洞见,强调准确识别视觉线索是实现稳健推理的基础。