Beyond Bare Queries: Open-Vocabulary Object Retrieval with 3D Scene Graph

Locating objects referred to in natural language poses a significant challenge for autonomous agents. Existing CLIP-based open-vocabulary methods successfully perform 3D object retrieval with simple (bare) queries but cannot cope with ambiguous descriptions that demand an understanding of object relations. To tackle this problem, we propose a modular approach called BBQ (Beyond Bare Queries), which constructs 3D scene spatial graph representation with metric edges and utilizes a large language model as a human-to-agent interface through our deductive scene reasoning algorithm. BBQ employs robust DINO-powered associations to form 3D objects, an advanced raycasting algorithm to project them to 2D, and a vision-language model to describe them as graph nodes. On Replica and ScanNet datasets, we show that the designed method accurately constructs 3D object-centric maps. We have demonstrated that their quality takes a leading place for open-vocabulary 3D semantic segmentation against other zero-shot methods. Also, we show that leveraging spatial relations is especially effective for scenes containing multiple entities of the same semantic class. On Sr3D and Nr3D benchmarks, our deductive approach demonstrates a significant improvement, enabling retrieving objects by complex queries compared to other state-of-the-art methods. Considering our design solutions, we achieved a processing speed approximately x3 times faster than the closest analog. This promising performance enables our approach for usage in applied intelligent robotics projects. We make the code publicly available at linukc.github.io/bbq/.

翻译：在自然语言中定位所指物体对自主智能体构成重大挑战。现有的基于CLIP的开放词汇方法能够成功处理简单（基础）查询的三维物体检索任务，但无法应对需要理解物体关系的模糊描述。为解决此问题，我们提出了一种模块化方法BBQ（超越简单查询），该方法通过度量边构建三维场景空间图表示，并利用大语言模型作为人机交互接口，结合我们提出的演绎场景推理算法。BBQ采用基于DINO的鲁棒关联构建三维物体，通过先进的光线投射算法将其投影至二维平面，并运用视觉语言模型将物体描述为图节点。在Replica和ScanNet数据集上的实验表明，所设计的方法能精确构建以物体为中心的三维地图。我们证明该方法在开放词汇三维语义分割任务中，其质量在零样本方法中处于领先地位。同时，研究显示利用空间关系对于包含多个相同语义类别实体的场景尤为有效。在Sr3D和Nr3D基准测试中，我们的演绎方法相比其他先进方法展现出显著改进，能够通过复杂查询检索物体。通过我们的设计方案，处理速度达到最接近方案的约3倍。这种优异性能使得我们的方法适用于智能机器人应用项目。代码已在linukc.github.io/bbq/公开。