The ability to perform reliable long-horizon task planning is crucial for deploying robots in real-world environments. However, directly employing Large Language Models (LLMs) as action sequence generators often results in low success rates due to their limited reasoning ability for long-horizon embodied tasks. In the STEP framework, we construct a subgoal tree through a pair of closed-loop models: a subgoal decomposition model and a leaf node termination model. Within this framework, we develop a hierarchical tree structure that spans from coarse to fine resolutions. The subgoal decomposition model leverages a foundation LLM to break down complex goals into manageable subgoals, thereby spanning the subgoal tree. The leaf node termination model provides real-time feedback based on environmental states, determining when to terminate the tree spanning and ensuring each leaf node can be directly converted into a primitive action. Experiments conducted in both the VirtualHome WAH-NL benchmark and on real robots demonstrate that STEP achieves long-horizon embodied task completion with success rates up to 34% (WAH-NL) and 25% (real robot) outperforming SOTA methods.
翻译:执行可靠的长时程任务规划能力对于在现实环境中部署机器人至关重要。然而,直接使用大型语言模型作为动作序列生成器,由于其对于长时程具身任务的推理能力有限,通常导致较低的成功率。在STEP框架中,我们通过一对闭环模型构建子目标树:子目标分解模型和叶节点终止模型。在此框架内,我们开发了一种从粗粒度到细粒度的层次化树结构。子目标分解模型利用基础LLM将复杂目标分解为可管理的子目标,从而扩展子目标树。叶节点终止模型基于环境状态提供实时反馈,决定何时终止树的扩展,并确保每个叶节点可以直接转换为原始动作。在VirtualHome WAH-NL基准测试和真实机器人上进行的实验表明,STEP实现了长时程具身任务完成,成功率分别高达34%(WAH-NL)和25%(真实机器人),优于SOTA方法。