Object-centric learning (OCL) aims to learn structured scene representations that support compositional generalization and robustness to out-of-distribution (OOD) data. However, OCL models are often not evaluated regarding these goals. Instead, most prior work focuses on evaluating OCL models solely through object discovery and simple reasoning tasks, such as probing the representation via image classification. We identify two limitations in existing benchmarks: (1) They provide limited insights on the representation usefulness of OCL models, and (2) localization and representation usefulness are assessed using disjoint metrics. To address (1), we use instruction-tuned VLMs as evaluators, enabling scalable benchmarking across diverse VQA datasets to measure how well VLMs leverage OCL representations for complex reasoning tasks. To address (2), we introduce a unified evaluation task and metric that jointly assess localization (where) and representation usefulness (what), thereby eliminating inconsistencies introduced by disjoint evaluation. Finally, we include a simple multi-feature reconstruction baseline as a reference point.
翻译:物体中心学习(OCL)旨在学习支持组合泛化和对分布外(OOD)数据鲁棒性的结构化场景表示。然而,OCL模型通常未针对这些目标进行评估。相反,大多数先前工作仅通过物体发现和简单推理任务(例如通过图像分类探查表示)来评估OCL模型。我们指出了现有基准测试中的两个局限性:(1)它们对OCL模型表示的有用性提供有限的见解;(2)定位与表示有用性使用分离的指标进行评估。为解决(1),我们采用指令调优的视觉语言模型(VLM)作为评估器,使其能够在多样化的视觉问答数据集上进行可扩展的基准测试,以衡量VLM在复杂推理任务中利用OCL表示的效果。为解决(2),我们引入了一个统一的评估任务和指标,联合评估定位(何处)与表示有用性(何物),从而消除了因分离评估导致的不一致性。最后,我们纳入一个简单的多特征重建基线作为参考基准。