Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers?

Object binding, the brain's ability to bind the many features that collectively represent an object into a coherent whole, is central to human cognition. It groups low-level perceptual features into high-level object representations, stores those objects efficiently and compositionally in memory, and supports human reasoning about individual object instances. While prior work often imposes object-centric attention (e.g., Slot Attention) explicitly to probe these benefits, it remains unclear whether this ability naturally emerges in pre-trained Vision Transformers (ViTs). Intuitively, they could: recognizing which patches belong to the same object should be useful for downstream prediction and thus guide attention. Motivated by the quadratic nature of self-attention, we hypothesize that ViTs represent whether two patches belong to the same object, a property we term IsSameObject. We decode IsSameObject from patch embeddings across ViT layers using a quadratic similarity probe, which reaches over 90% accuracy. Crucially, this object-binding capability emerges reliably in DINO, CLIP, and ImageNet-supervised ViTs, but is markedly weaker in MAE, suggesting that binding is not a trivial architectural artifact, but an ability acquired through specific pretraining objectives. We further discover that IsSameObject is encoded in a low-dimensional subspace on top of object features, and that this signal actively guides attention. Ablating IsSameObject from model activations degrades downstream performance and works against the learning objective, implying that emergent object binding naturally serves the pretraining objective. Our findings challenge the view that ViTs lack object binding and highlight how symbolic knowledge of "which parts belong together" emerges naturally in a connectionist system.

翻译：物体绑定是大脑将共同表征物体的众多特征整合为连贯整体的能力，这对人类认知至关重要。它将低层次感知特征组合成高层次物体表征，以高效且组合的方式将这些物体存储在记忆中，并支持人类对个体物体实例的推理。尽管先前研究常通过显式施加物体中心注意力（如Slot Attention）来探究这些优势，但预训练视觉Transformer（ViTs）中是否自然涌现出这种能力仍不明确。直观而言，它们可能具备这种能力：识别哪些图像块属于同一物体应有助于下游预测，从而引导注意力机制。基于自注意力机制的二次特性，我们假设ViTs能够表征两个图像块是否属于同一物体，这一特性我们称为IsSameObject。通过二次相似性探针从ViT各层的图像块嵌入中解码IsSameObject，其准确率超过90%。关键的是，这种物体绑定能力在DINO、CLIP和ImageNet监督训练的ViTs中均可靠地涌现，但在MAE中明显较弱，表明绑定并非简单的架构产物，而是通过特定预训练目标获得的能力。我们进一步发现IsSameObject编码在物体特征之上的低维子空间中，且该信号能主动引导注意力。从模型激活中消除IsSameObject会降低下游性能并违背学习目标，这意味着涌现的物体绑定能力自然服务于预训练目标。我们的发现挑战了“ViTs缺乏物体绑定能力”的观点，并揭示了“哪些部分属于同一整体”的符号化知识如何在连接主义系统中自然涌现。