SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models

Measuring structured object understanding in vision foundation models remains challenging due to inconsistent evaluation protocols and limited part-level supervision. Semantic correspondence (SC) evaluates this capability by testing whether object parts can be matched across instances and categories under large variations in appearance, viewpoint, and geometry. To enable a systematic SC evaluation, we introduce SOCO, a new benchmark for Semantic Object Correspondence that introduces a taxonomy of correspondence types and provides consistent, functionally meaningful keypoint annotations across 100 categories and over 1M correspondence pairs. In addition, SOCO includes keypoint language descriptions, enabling the evaluation of large vision-language models (LVLMs) and their fine-grained part-level understanding. Comprehensive experiments reveal that (i) vision foundation backbones encode strong semantic structure but transfer correspondences poorly across related categories and only partially capture object-part position, (ii) LVLMs are stronger at text-prompted part localization than at visual-reference cross-image matching, exposing a gap between language-grounded localization and fine-grained visual correspondence, and (iii) correspondence performance predicts performance on dense downstream tasks, including segmentation, tracking, 3D pose estimation, and 3D detection, more strongly than ImageNet classification. Together, these findings position SOCO as a benchmark for structured, part-level representation quality in vision and multimodal foundation models.

翻译：评估视觉基础模型中的结构化对象理解仍面临挑战，这是由于评估协议不一致且部件级监督信息有限所致。语义对应关系通过测试对象部件能否在实例和类别之间跨越外观、视角和几何形态的显著变化进行匹配，从而评估该能力。为建立系统化的语义对应评估体系，我们提出SOCO——一个全新的语义对象对应关系基准。该基准引入对应关系类型分类体系，并在100个类别、超过100万个对应关系对中提供一致且具有功能意义的语义关键点标注。此外，SOCO包含关键点语言描述，使其能够评估大型视觉语言模型及其细粒度部件级理解能力。综合实验表明：(i)视觉基础骨干网络编码了强语义结构，但在相关类别间的对应关系迁移能力较差，且仅能部分捕获对象部件位置信息；(ii)大型视觉语言模型在文本提示的部件定位方面强于视觉参考的跨图像匹配，暴露出语言引导定位与细粒度视觉对应之间的差距；(iii)相较于ImageNet分类，对应关系性能对密集下游任务（包括分割、跟踪、3D姿态估计和3D检测）的性能预测能力更强。综合来看，这些发现将SOCO定位为评估视觉与多模态基础模型中结构化部件级表示质量的基准。