If robots are to work effectively alongside people, they must be able to interpret natural language references to objects in their 3D environment. Understanding 3D referring expressions is challenging -- it requires the ability to both parse the 3D structure of the scene and correctly ground free-form language in the presence of distraction and clutter. We introduce Transcrib3D, an approach that brings together 3D detection methods and the emergent reasoning capabilities of large language models (LLMs). Transcrib3D uses text as the unifying medium, which allows us to sidestep the need to learn shared representations connecting multi-modal inputs, which would require massive amounts of annotated 3D data. As a demonstration of its effectiveness, Transcrib3D achieves state-of-the-art results on 3D reference resolution benchmarks, with a great leap in performance from previous multi-modality baselines. To improve upon zero-shot performance and facilitate local deployment on edge computers and robots, we propose self-correction for fine-tuning that trains smaller models, resulting in performance close to that of large models. We show that our method enables a real robot to perform pick-and-place tasks given queries that contain challenging referring expressions. Project site is at https://ripl.github.io/Transcrib3D.
翻译:为使机器人能够有效地与人类协同工作,它们必须能够理解自然语言中对三维环境中物体的指代。理解三维指代表达是一项挑战——它需要既能解析场景的三维结构,又能在存在干扰和杂乱的情况下正确地将自由形式语言与场景对应。我们提出了Transcrib3D,一种将三维检测方法与大型语言模型(LLMs)涌现的推理能力相结合的方法。Transcrib3D使用文本作为统一媒介,这使我们能够绕过学习连接多模态输入的共享表示这一需求——而后者需要大量标注的三维数据。作为其有效性的证明,Transcrib3D在三维指代解析基准测试中取得了最先进的结果,性能较之前的多模态基线有了巨大提升。为了改善零样本性能并便于在边缘计算机和机器人上进行本地部署,我们提出了用于微调的自校正方法,用于训练更小的模型,使其性能接近大型模型。我们展示了该方法能使真实机器人在包含具有挑战性的指代表达式的查询下执行拾取-放置任务。项目网站位于https://ripl.github.io/Transcrib3D。