3D visual grounding is crucial for robots, requiring integration of natural language and 3D scene understanding. Traditional methods depending on supervised learning with 3D point clouds are limited by scarce datasets. Recently zero-shot methods leveraging LLMs have been proposed to address the data issue. While effective, these methods only use object-centric information, limiting their ability to handle complex queries. In this work, we present VLM-Grounder, a novel framework using vision-language models (VLMs) for zero-shot 3D visual grounding based solely on 2D images. VLM-Grounder dynamically stitches image sequences, employs a grounding and feedback scheme to find the target object, and uses a multi-view ensemble projection to accurately estimate 3D bounding boxes. Experiments on ScanRefer and Nr3D datasets show VLM-Grounder outperforms previous zero-shot methods, achieving 51.6% Acc@0.25 on ScanRefer and 48.0% Acc on Nr3D, without relying on 3D geometry or object priors. Codes are available at https://github.com/OpenRobotLab/VLM-Grounder .
翻译:三维视觉定位对机器人至关重要,它需要融合自然语言与三维场景理解。依赖三维点云监督学习的传统方法受限于稀缺的数据集。近期,利用大型语言模型的零样本方法被提出以解决数据问题。这些方法虽然有效,但仅使用以物体为中心的信息,限制了其处理复杂查询的能力。本文提出VLM-Grounder,一种基于纯二维图像、利用视觉语言模型进行零样本三维视觉定位的新框架。VLM-Grounder动态拼接图像序列,采用定位与反馈机制来寻找目标物体,并利用多视图集成投影来精确估计三维边界框。在ScanRefer和Nr3D数据集上的实验表明,VLM-Grounder优于先前的零样本方法,在ScanRefer上达到51.6%的Acc@0.25,在Nr3D上达到48.0%的Acc,且不依赖于三维几何或物体先验。代码发布于https://github.com/OpenRobotLab/VLM-Grounder。