3D visual grounding has made notable progress in localizing objects within complex 3D scenes. However, grounding referring expressions beyond objects in 3D scenes remains unexplored. In this paper, we introduce Anywhere3D-Bench, a holistic 3D visual grounding benchmark consisting of 2,886 referring expression-3D bounding box pairs spanning four different grounding levels: human-activity areas, unoccupied space beyond objects, individual objects in the scene, and fine-grained object parts. We assess a range of state-of-the-art 3D visual grounding methods alongside large language models (LLMs) and multimodal LLMs (MLLMs) on Anywhere3D-Bench. Experimental results reveal that space-level and part-level visual grounding pose the greatest challenges: space-level tasks require a more comprehensive spatial reasoning ability, for example, modeling distances and spatial relations within 3D space, while part-level tasks demand fine-grained perception of object composition. Even the best performance model, OpenAI o4-mini, achieves only 23.00% accuracy on space-level tasks and 31.46% on part-level tasks, significantly lower than its performance on area-level and object-level tasks. These findings underscore a critical gap in current models' capacity to understand and reason about 3D scenes beyond object-level semantics.
翻译:三维视觉定位在复杂三维场景中定位物体方面取得了显著进展。然而,针对三维场景中超越物体层面的指代表达式定位研究仍属空白。本文提出Anywhere3D-Bench——一个综合性的三维视觉定位基准,包含2,886个指代表达式-三维边界框配对数据,涵盖四个不同层级的定位任务:人类活动区域、物体之外的未占用空间、场景中的独立物体以及细粒度物体部件。我们在Anywhere3D-Bench上评估了多种前沿三维视觉定位方法,以及大语言模型(LLMs)和多模态大语言模型(MLLMs)。实验结果表明空间层级和部件层级的视觉定位最具挑战性:空间层级任务需要更全面的空间推理能力,例如对三维空间内距离和空间关系的建模;而部件层级任务则要求对物体构成的细粒度感知能力。即使表现最佳的模型OpenAI o4-mini,在空间层级任务上仅达到23.00%的准确率,在部件层级任务上为31.46%,显著低于其在区域层级和物体层级任务上的表现。这些发现揭示了当前模型在理解和推理超越物体语义的三维场景方面存在关键能力缺口。