Compositional generalization, the ability to reason about novel combinations of familiar concepts, is fundamental to human cognition and a critical challenge for machine learning. Object-centric (OC) representations, which encode a scene as a set of objects, are often argued to support such generalization, but systematic evidence in visually rich settings is limited. We introduce a Visual Question Answering benchmark across three controlled visual worlds (CLEVRTex, Super-CLEVR, and MOVi-C) to measure how well vision encoders, with and without object-centric biases, generalize to unseen combinations of object properties. To ensure a fair and comprehensive comparison, we carefully account for training data diversity, sample size, representation size, downstream model capacity, and compute. We use DINOv2 and SigLIP2, two widely used vision encoders, as the foundation models and their OC counterparts. Our key findings reveal that (1) OC approaches are superior in harder compositional generalization settings; (2) original dense representations surpass OC only on easier settings and typically require substantially more downstream compute; and (3) OC models are more sample efficient, achieving stronger generalization with fewer images, whereas dense encoders catch up or surpass them only with sufficient data and diversity. Overall, object-centric representations offer stronger compositional generalization when any one of dataset size, training data diversity, or downstream compute is constrained.
翻译:组合泛化,即对熟悉概念的新颖组合进行推理的能力,是人类认知的基础,也是机器学习面临的关键挑战。以物体为中心(OC)的表示将场景编码为一组物体,常被认为支持此类泛化,但在视觉丰富场景中的系统性证据有限。我们引入了一个基于三个受控视觉世界(CLEVRTex、Super-CLEVR 和 MOVi-C)的视觉问答基准,以衡量具有或不具有物体中心偏见的视觉编码器对未见过的物体属性组合的泛化能力。为确保公平全面的比较,我们仔细考虑了训练数据多样性、样本量、表示大小、下游模型容量和计算量。我们使用 DINOv2 和 SigLIP2 这两种广泛使用的视觉编码器作为基础模型及其对应的 OC 版本。我们的主要发现表明:(1)在更困难的组合泛化场景中,OC 方法表现更优;(2)原始密集表示仅在较简单的场景中超越 OC,且通常需要显著更多的下游计算;(3)OC 模型样本效率更高,能用更少的图像实现更强的泛化,而密集编码器只有在数据足够且多样时才能赶上或超越它们。总体而言,当数据集大小、训练数据多样性或下游计算中的任何一项受到限制时,以物体为中心的表示能提供更强的组合泛化能力。