Task planning is an important component of traditional robotics systems enabling robots to compose fine grained skills to perform more complex tasks. Recent work building systems for translating natural language to executable actions for task completion in simulated embodied agents is focused on directly predicting low level action sequences that would be expected to be directly executable by a physical robot. In this work, we instead focus on predicting a higher level plan representation for one such embodied task completion dataset - TEACh, under the assumption that techniques for high-level plan prediction from natural language are expected to be more transferable to physical robot systems. We demonstrate that better plans can be predicted using multimodal context, and that plan prediction and plan execution modules are likely dependent on each other and hence it may not be ideal to fully decouple them. Further, we benchmark execution of oracle plans to quantify the scope for improvement in plan prediction models.
翻译:任务规划是传统机器人系统的重要组成部分,使机器人能够组合细粒度技能以执行更复杂的任务。近期针对模拟具身代理中自然语言到可执行动作翻译的系统研究,主要集中于直接预测物理机器人可直接执行的低层级动作序列。在本研究中,我们转而聚焦于一个具身任务完成数据集——TEACh的高层级计划表示预测,其假设依据是:相较于低层级动作序列,从自然语言进行高层级计划预测的技术预期更具可迁移性,更适用于物理机器人系统。我们证明,利用多模态上下文能够预测更优的计划,且计划预测与计划执行模块之间存在相互依赖关系,因此完全解耦两个模块可能并非理想方案。此外,我们对理想计划执行进行基准测试,以量化计划预测模型的改进空间。