Multimodal Large Language Models (MLLMs) have shown remarkable proficiency on general-purpose vision-language benchmarks, reaching or even exceeding human-level performance. However, these evaluations typically rely on standard in-distribution data, leaving the robustness of MLLMs largely unexamined when faced with scenarios that defy common-sense priors. To address this gap, we introduce VIA-Bench, a challenging benchmark designed to probe model performance on visual illusions and anomalies. It includes six core categories: color illusions, motion illusions, gestalt illusions, geometric and spatial illusions, general visual illusions, and visual anomalies. Through careful human-in-the-loop review, we construct over 1K high-quality question-answer pairs that require nuanced visual reasoning. Extensive evaluation of over 20 state-of-the-art MLLMs, including proprietary, open-source, and reasoning-enhanced models, uncovers significant vulnerabilities. Notably, we find that Chain-of-Thought (CoT) reasoning offers negligible robustness, often yielding ``brittle mirages'' where the model's logic collapses under illusory stimuli. Our findings reveal a fundamental divergence between machine and human perception, suggesting that resolving such perceptual bottlenecks is critical for the advancement of artificial general intelligence. The benchmark data and code will be released.
翻译:多模态大语言模型(MLLMs)在通用视觉-语言基准测试中展现出卓越能力,达到甚至超越了人类水平。然而,这些评估通常依赖于标准的同分布数据,使得MLLMs在面对违背常识先验的场景时的鲁棒性在很大程度上未经检验。为填补这一空白,我们提出了VIA-Bench,一个旨在探究模型在视觉错觉和异常上性能的挑战性基准。它包含六个核心类别:颜色错觉、运动错觉、格式塔错觉、几何与空间错觉、一般视觉错觉以及视觉异常。通过细致的人工参与循环审查,我们构建了超过1000个需要精细视觉推理的高质量问答对。对超过20个最先进的MLLMs(包括专有模型、开源模型以及推理增强模型)的广泛评估揭示了显著的脆弱性。值得注意的是,我们发现思维链(CoT)推理提供的鲁棒性微乎其微,常常产生“脆弱的幻象”——模型的逻辑在错觉刺激下崩溃。我们的研究结果揭示了机器感知与人类感知之间的根本性差异,表明解决此类感知瓶颈对于推动通用人工智能的发展至关重要。基准数据与代码将予以公开。