3D Visual Grounding (3DVG) aims to locate objects in 3D scenes based on textual descriptions, which is essential for applications like augmented reality and robotics. Traditional 3DVG approaches rely on annotated 3D datasets and predefined object categories, limiting scalability and adaptability. To overcome these limitations, we introduce SeeGround, a zero-shot 3DVG framework leveraging 2D Vision-Language Models (VLMs) trained on large-scale 2D data. We propose to represent 3D scenes as a hybrid of query-aligned rendered images and spatially enriched text descriptions, bridging the gap between 3D data and 2D-VLMs input formats. We propose two modules: the Perspective Adaptation Module, which dynamically selects viewpoints for query-relevant image rendering, and the Fusion Alignment Module, which integrates 2D images with 3D spatial descriptions to enhance object localization. Extensive experiments on ScanRefer and Nr3D demonstrate that our approach outperforms existing zero-shot methods by large margins. Notably, we exceed weakly supervised methods and rival some fully supervised ones, outperforming previous SOTA by 7.7% on ScanRefer and 7.1% on Nr3D, showcasing its effectiveness.
翻译:三维视觉定位(3DVG)旨在根据文本描述在三维场景中定位物体,这对于增强现实和机器人等应用至关重要。传统的3DVG方法依赖于标注的三维数据集和预定义的对象类别,限制了其可扩展性和适应性。为克服这些限制,我们提出了SeeGround——一种利用基于大规模二维数据训练的视觉语言模型(VLMs)的零样本3DVG框架。我们提出将三维场景表示为查询对齐的渲染图像与空间增强文本描述的混合体,从而弥合三维数据与二维VLM输入格式之间的鸿沟。我们设计了两个核心模块:视角自适应模块(动态选择与查询相关的图像渲染视点)和融合对齐模块(整合二维图像与三维空间描述以提升物体定位精度)。在ScanRefer和Nr3D数据集上的大量实验表明,我们的方法显著优于现有零样本方法。值得注意的是,本方法不仅超越了弱监督方法,甚至可与部分全监督方法相媲美,在ScanRefer和Nr3D数据集上分别以7.7%和7.1%的优势刷新了先前最佳性能,充分证明了其有效性。