Human communication often relies on visual cues to resolve ambiguity. While humans can intuitively integrate these cues, AI systems often find it challenging to engage in sophisticated multimodal reasoning. We introduce VAGUE, a benchmark evaluating multimodal AI systems' ability to integrate visual context for intent disambiguation. VAGUE consists of 1.6K ambiguous textual expressions, each paired with an image and multiple-choice interpretations, where the correct answer is only apparent with visual context. The dataset spans both staged, complex (Visual Commonsense Reasoning) and natural, personal (Ego4D) scenes, ensuring diversity. Our experiments reveal that existing multimodal AI models struggle to infer the speaker's true intent. While performance consistently improves from the introduction of more visual cues, the overall accuracy remains far below human performance, highlighting a critical gap in multimodal reasoning. Analysis of failure cases demonstrates that current models fail to distinguish true intent from superficial correlations in the visual scene, indicating that they perceive images but do not effectively reason with them. We release our code and data at https://github.com/Hazel-Heejeong-Nam/VAGUE.git.
翻译:人类交流常依赖视觉线索来消除歧义。尽管人类能直观整合这些线索,人工智能系统却往往难以进行复杂的多模态推理。我们提出了VAGUE——一个评估多模态AI系统整合视觉语境进行意图消歧能力的基准。VAGUE包含1.6K个模糊文本表达,每个表达均配有图像和多项选择释义,其中正确答案仅通过视觉语境才能显现。该数据集涵盖摆拍复杂场景(视觉常识推理)与自然个人场景(Ego4D),确保多样性。实验表明,现有多模态AI模型难以推断说话者的真实意图。虽然引入更多视觉线索能持续提升性能,但整体准确率仍远低于人类水平,这凸显了多模态推理的关键差距。对失败案例的分析表明,当前模型无法区分真实意图与视觉场景中的表面关联,说明其虽能感知图像却未能有效进行推理。我们在https://github.com/Hazel-Heejeong-Nam/VAGUE.git 发布了代码与数据。