The convergence of embodied agents and large language models (LLMs) has brought significant advancements to embodied instruction following. Particularly, the strong reasoning capabilities of LLMs make it possible for robots to perform long-horizon tasks without expensive annotated demonstrations. However, public benchmarks for testing the long-horizon reasoning capabilities of language-conditioned robots in various scenarios are still missing. To fill this gap, this work focuses on the tabletop manipulation task and releases a simulation benchmark, \textit{LoHoRavens}, which covers various long-horizon reasoning aspects spanning color, size, space, arithmetics and reference. Furthermore, there is a key modality bridging problem for long-horizon manipulation tasks with LLMs: how to incorporate the observation feedback during robot execution for the LLM's closed-loop planning, which is however less studied by prior work. We investigate two methods of bridging the modality gap: caption generation and learnable interface for incorporating explicit and implicit observation feedback to the LLM, respectively. These methods serve as the two baselines for our proposed benchmark. Experiments show that both methods struggle to solve some tasks, indicating long-horizon manipulation tasks are still challenging for current popular models. We expect the proposed public benchmark and baselines can help the community develop better models for long-horizon tabletop manipulation tasks.
翻译:具身智能体与大语言模型的融合显著推动了具身指令跟随领域的发展。特别是,大语言模型强大的推理能力使得机器人能够在无需昂贵标注示范的情况下执行长时序任务。然而,目前仍缺乏用于测试语言条件机器人在多种场景下长时序推理能力的公开基准。为填补这一空白,本研究聚焦桌面操作任务,发布了仿真基准测试集《LoHoRavens》,涵盖颜色、尺寸、空间、算术与参照等维度的多样化长时序推理任务。此外,长时序操作任务与大语言模型的结合面临一个关键模态桥接问题:如何在机器人执行过程中将观测反馈整合至大语言模型的闭环规划中?这一问题在先前研究中较少被探讨。我们研究了两种模态差距桥接方法:通过标题生成与可学习接口分别向大语言模型引入显式与隐式观测反馈。这些方法作为所提基准的两类基线。实验表明,两类方法均难以解决部分任务,说明当前主流模型在处理长时序操作任务时仍面临挑战。我们期望所提出的公开基准与基线能够帮助社区开发更优的长时序桌面操作任务模型。