3D visual grounding (3DVG) aims to localize objects in a 3D scene based on natural language queries. In this work, we explore zero-shot 3DVG from multi-view images alone, without requiring any geometric supervision or object priors. We introduce Z3D, a universal grounding pipeline that flexibly operates on multi-view images while optionally incorporating camera poses and depth maps. We identify key bottlenecks in prior zero-shot methods causing significant performance degradation and address them with (i) a state-of-the-art zero-shot 3D instance segmentation method to generate high-quality 3D bounding box proposals and (ii) advanced reasoning via prompt-based segmentation, which utilizes full capabilities of modern VLMs. Extensive experiments on the ScanRefer and Nr3D benchmarks demonstrate that our approach achieves state-of-the-art performance among zero-shot methods. Code is available at https://github.com/col14m/z3d .
翻译:三维视觉定位(3DVG)旨在根据自然语言查询在三维场景中定位物体。本文探索仅从多视角图像出发的零样本三维视觉定位,无需任何几何监督或物体先验知识。我们提出了Z3D——一种通用的定位流程,可灵活处理多视角图像,并可选择性地融合相机位姿与深度图。我们揭示了现有零样本方法中导致性能显著下降的关键瓶颈,并通过以下方式予以解决:(i)采用最先进的零样本三维实例分割方法生成高质量的三维边界框候选;(ii)通过基于提示词的分割实现高级推理,充分发挥现代视觉语言模型的全部能力。在ScanRefer和Nr3D基准数据集上的大量实验表明,本方法在零样本方法中达到了最先进的性能。代码发布于 https://github.com/col14m/z3d。