To fully leverage the capabilities of mobile manipulation robots, it is imperative that they are able to autonomously execute long-horizon tasks in large unexplored environments. While large language models (LLMs) have shown emergent reasoning skills on arbitrary tasks, existing work primarily concentrates on explored environments, typically focusing on either navigation or manipulation tasks in isolation. In this work, we propose MoMa-LLM, a novel approach that grounds language models within structured representations derived from open-vocabulary scene graphs, dynamically updated as the environment is explored. We tightly interleave these representations with an object-centric action space. Given object detections, the resulting approach is zero-shot, open-vocabulary, and readily extendable to a spectrum of mobile manipulation and household robotic tasks. We demonstrate the effectiveness of MoMa-LLM in a novel semantic interactive search task in large realistic indoor environments. In extensive experiments in both simulation and the real world, we show substantially improved search efficiency compared to conventional baselines and state-of-the-art approaches, as well as its applicability to more abstract tasks. We make the code publicly available at http://moma-llm.cs.uni-freiburg.de.
翻译:为充分发挥移动机械臂的作业能力,其必须能够在未知的大型环境中自主执行长周期任务。尽管大语言模型(LLMs)已在任意任务上展现出涌现推理能力,现有研究主要集中于已探索环境,且通常孤立地关注导航或操作任务。本研究提出MoMa-LLM——一种创新方法,将语言模型锚定于开放词汇场景图衍生的结构化表征中,该表征随环境探索动态更新。我们将这些表征与以物体为中心的动作空间紧密交织。基于物体检测结果,所提出的方法具备零样本、开放词汇特性,并可轻松扩展至各类移动操作与家庭机器人任务。我们在大型真实室内环境的新型语义交互式搜索任务中验证了MoMa-LLM的有效性。通过大量仿真与真实世界实验证明,相较于传统基线方法与前沿技术,本方法显著提升了搜索效率,并展示了其在更抽象任务中的适用性。代码已公开于http://moma-llm.cs.uni-freiburg.de。