Benchmarking 3D spatial understanding of foundation models is essential for real-world applications such as robotics and autonomous driving. Existing evaluations often rely on downstream finetuning with linear heads or task-specific decoders, making it difficult to isolate the intrinsic 3D reasoning ability of pretrained encoders. In this work, we introduce a novel benchmark for in-context 3D scene understanding that requires no finetuning and directly probes the quality of dense visual features. Building on the Hummingbird framework, which evaluates in-context 2D scene understanding, we extend the setup to the 3D Multi-View ImageNet (MVImgNet) dataset. Given a set of images from objects in specific angles (keys), we benchmark the performance of segmenting novel views (queries) and report the scores in 4 categories of easy, medium, hard, and extreme based on the key-query view contrast. We benchmark 8 state-of-the-art foundation models and show DINO-based encoders remain competitive across large viewpoint shifts, while 3D-aware models like VGGT require dedicated multi-view adjustments. Our code is publicly available at https://github.com/ToyeshC/open-hummingbird-3d-eval .
翻译:评估基础模型的三维空间理解能力对于机器人学和自动驾驶等实际应用至关重要。现有评估方法通常依赖于使用线性头或任务特定解码器进行下游微调,这使得难以分离预训练编码器固有的三维推理能力。在本研究中,我们引入了一种新颖的上下文三维场景理解基准,无需微调即可直接探测密集视觉特征的质量。基于评估上下文二维场景理解的Hummingbird框架,我们将该设置扩展至三维多视角ImageNet(MVImgNet)数据集。给定一组特定角度(关键视角)下的物体图像,我们评估分割新视角(查询视角)的性能,并根据关键-查询视角对比度在简单、中等、困难和极端四个类别中报告分数。我们对8个最先进的基础模型进行了基准测试,结果表明基于DINO的编码器在大视角变化下仍保持竞争力,而像VGGT这样的三维感知模型则需要专门的多视角调整。我们的代码公开发布于https://github.com/ToyeshC/open-hummingbird-3d-eval。