Inspired by human categorization, object property reasoning involves identifying and recognizing low-level details and higher-level abstractions. While current visual question answering (VQA) studies consider multiple object properties, such as size, they typically blend perception and reasoning and lack representativeness in terms of reasoning and image categories, making it unclear whether and how vision-language models (VLMs) abstract and reason over depicted objects. To this end, we introduce a systematic evaluation framework comprising images of three representative types, three reasoning levels of increasing complexity, and four object property dimensions, informed by prior work on common sense. We develop a procedure to instantiate this framework in two VQA object reasoning benchmarks: OPTICS-CNT, comprising 360 images paired with 1,080 multi-level, count-based questions, and OPTICS-CMP, with 2.1k comparison questions. Experiments with 12 state-of-the-art VLMs in zero-shot settings reveal significant limitations relative to humans, with the best-performing model achieving below 40% counting and 70% comparison accuracy. VLMs struggle particularly with photographic images, counterfactual reasoning, physical and functional properties, and higher counts. We make the OPTICS benchmark data and code available to support future work on scalable benchmarking methods, generalized annotation guidelines, and advanced reasoning VLMs.
翻译:受人类分类能力启发,对象属性推理涉及对低层次细节与高层次抽象特征的识别与认知。尽管当前视觉问答研究已考虑尺寸等多种对象属性,但通常将感知与推理过程混为一谈,且在推理类型与图像类别方面缺乏代表性,导致我们难以明确视觉语言模型是否及如何对描绘对象进行抽象与推理。为此,我们基于常识研究基础,构建了包含三种代表性图像类型、三个递增复杂度推理层级及四个对象属性维度的系统化评估框架。通过开发标准化流程,我们将该框架实例化为两个视觉问答对象推理基准:OPTICS-CNT包含360张图像及1080个多层级计数问题,OPTICS-CMP包含2100个比较问题。对12个前沿视觉语言模型进行的零样本实验显示,其相对人类表现存在显著局限——最优模型在计数任务中准确率低于40%,比较任务中低于70%。视觉语言模型在摄影图像、反事实推理、物理与功能属性及高数量计数任务中表现尤为困难。我们公开OPTICS基准数据与代码,以支持未来在可扩展基准方法、通用标注准则及先进推理视觉语言模型方面的研究。