To fully leverage the capabilities of mobile manipulation robots, it is imperative that they are able to autonomously execute long-horizon tasks in large unexplored environments. While large language models (LLMs) have shown emergent reasoning skills on arbitrary tasks, existing work primarily concentrates on explored environments, typically focusing on either navigation or manipulation tasks in isolation. In this work, we propose MoMa-LLM, a novel approach that grounds language models within structured representations derived from open-vocabulary scene graphs, dynamically updated as the environment is explored. We tightly interleave these representations with an object-centric action space. The resulting approach is zero-shot, open-vocabulary, and readily extendable to a spectrum of mobile manipulation and household robotic tasks. We demonstrate the effectiveness of MoMa-LLM in a novel semantic interactive search task in large realistic indoor environments. In extensive experiments in both simulation and the real world, we show substantially improved search efficiency compared to conventional baselines and state-of-the-art approaches, as well as its applicability to more abstract tasks. We make the code publicly available at http://moma-llm.cs.uni-freiburg.de.
翻译:为充分发挥移动操控机器人的潜力,必须使其能够在未探索的大规模环境中自主执行长时域任务。尽管大语言模型在任意任务上展现出涌现推理能力,现有工作主要集中于已探索环境,且通常孤立地处理导航或操控任务。本文提出MoMa-LLM——一种将语言模型嵌入由开放词汇场景图导出的结构化表示中的新方法,该场景图随环境探索动态更新。我们通过以物体为中心的动作空间紧密交织这些表示,最终方法具备零样本、开放词汇特性,并能直接扩展至一系列移动操控与家庭机器人任务。我们在大型逼真室内环境中验证了MoMa-LLM在新颖的语义交互式搜索任务上的有效性。通过仿真与真实世界的大量实验,相比传统基线和现有最优方法,我们展示了显著提升的搜索效率,以及该方法在更抽象任务中的适用性。代码已开源至 http://moma-llm.cs.uni-freiburg.de。