Humans develop perception through a bottom-up hierarchy: from basic primitives and Gestalt principles to high-level semantics. In contrast, current Multimodal Large Language Models (MLLMs) are trained directly on complex downstream tasks, often bypassing these foundational visual capabilities. To systematically investigate this gap, we introduce VisFactor, a benchmark that digitizes 20 vision-centric subtests from FRCT, a well-established cognitive psychology assessment spanning four domains of human visual cognition. Furthermore, we design algorithms to automatically construct and validate unlimited test cases with controllable difficulty. Using VisFactor, we evaluate 23 frontier MLLMs, including both proprietary (e.g., GPT, Gemini) and open-source (e.g., LLaMA, Qwen) models. The best model achieves a score of only 30.17%. Models consistently fail on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination, regardless of model size or prompting strategy. These findings suggest that performance improvements on existing general benchmarks might represent castles in the air instead of a genuine mastery of human-like visual cognition.
翻译:人类通过自底向上的层次结构发展感知能力:从基本基元和格式塔原则到高级语义。相比之下,当前的多模态大语言模型(MLLMs)直接在复杂下游任务上进行训练,往往绕过了这些基础视觉能力。为系统性地研究这一差距,我们引入了VisFactor基准,该基准将来自FRCT(一项涵盖人类视觉认知四个领域的成熟认知心理学评估)的20个以视觉为中心的子测试进行了数字化。此外,我们设计了算法来自动构建和验证具有可控难度的无限测试用例。利用VisFactor,我们评估了23个前沿MLLMs,包括专有模型(例如GPT、Gemini)和开源模型(例如LLaMA、Qwen)。最佳模型的得分仅为30.17%。无论模型规模或提示策略如何,模型在心理旋转、空间关系推理和图形-背景辨别等任务上均持续失败。这些发现表明,在现有通用基准上的性能提升可能只是空中楼阁,而非真正掌握了类人的视觉认知能力。