The ability to perform reliable long-horizon task planning is crucial for deploying robots in real-world environments. However, directly employing Large Language Models (LLMs) as action sequence generators often results in low success rates due to their limited reasoning ability for long-horizon embodied tasks. In the STEP framework, we construct a subgoal tree through a pair of closed-loop models: a subgoal decomposition model and a leaf node termination model. Within this framework, we develop a hierarchical tree structure that spans from coarse to fine resolutions. The subgoal decomposition model leverages a foundation LLM to break down complex goals into manageable subgoals, thereby spanning the subgoal tree. The leaf node termination model provides real-time feedback based on environmental states, determining when to terminate the tree spanning and ensuring each leaf node can be directly converted into a primitive action. Experiments conducted in both the VirtualHome WAH-NL benchmark and on real robots demonstrate that STEP achieves long-horizon embodied task completion with success rates up to 34% (WAH-NL) and 25% (real robot) outperforming SOTA methods.
翻译:在现实世界中部署机器人,可靠执行长时程任务规划的能力至关重要。然而,直接使用大语言模型(LLMs)作为动作序列生成器,由于其对于具身长时程任务的推理能力有限,往往导致成功率较低。在STEP框架中,我们通过一对闭环模型构建子目标树:子目标分解模型和叶节点终止模型。在此框架内,我们开发了一种从粗到细分辨率跨越的层次化树状结构。子目标分解模型利用基础LLM将复杂目标分解为可管理的子目标,从而扩展子目标树。叶节点终止模型基于环境状态提供实时反馈,决定何时终止树的扩展,并确保每个叶节点可直接转换为原始动作。在VirtualHome WAH-NL基准测试和真实机器人上进行的实验表明,STEP在完成具身长时程任务方面取得了高达34%(WAH-NL)和25%(真实机器人)的成功率,优于现有最优方法。