Real-world robots localize objects from natural-language instructions while scenes around them keep changing. Yet most of the existing 3D visual grounding (3DVG) method still assumes a reconstructed and up-to-date point cloud, an assumption that forces costly re-scans and hinders deployment. We argue that 3DVG should be formulated as an active, memory-driven problem, and we introduce ChangingGrounding, the first benchmark that explicitly measures how well an agent can exploit past observations, explore only where needed, and still deliver precise 3D boxes in changing scenes. To set a strong reference point, we also propose Mem-ChangingGrounder, a zero-shot method for this task that marries cross-modal retrieval with lightweight multi-view fusion: it identifies the object type implied by the query, retrieves relevant memories to guide actions, then explores the target efficiently in the scene, falls back when previous operations are invalid, performs multi-view scanning of the target, and projects the fused evidence from multi-view scans to get accurate object bounding boxes. We evaluate different baselines on ChangingGrounding, and our Mem-ChangingGrounder achieves the highest localization accuracy while greatly reducing exploration cost. We hope this benchmark and method catalyze a shift toward practical, memory-centric 3DVG research for real-world applications. Project page: https://hm123450.github.io/CGB/ .
翻译:现实世界中的机器人通过自然语言指令定位物体时,其周围场景持续变化。然而,现有大多数三维视觉定位方法仍假设存在一个已重建且最新的点云,这一假设迫使进行代价高昂的重复扫描,阻碍了实际部署。我们认为三维视觉定位应被表述为一个主动的、记忆驱动的问题,并提出了ChangingGrounding——首个明确衡量智能体如何有效利用过往观测、仅在必要时进行探索,并能在动态场景中仍提供精确三维边界框的基准。为建立强有力的参考基准,我们同时提出了Mem-ChangingGrounder,一种针对此任务的零样本方法,该方法将跨模态检索与轻量级多视图融合相结合:它识别查询所隐含的物体类别,检索相关记忆以指导行动,随后在场景中高效探索目标,在先前操作无效时执行回退策略,对目标进行多视角扫描,并将多视角扫描的融合证据投影以获取精确的物体边界框。我们在ChangingGrounding上评估了不同基线方法,我们的Mem-ChangingGrounder在显著降低探索成本的同时,实现了最高的定位精度。我们希望该基准与方法能推动三维视觉定位研究向实用化、以记忆为中心的方向发展,以促进实际应用。项目页面:https://hm123450.github.io/CGB/ 。