Training generalist agents is difficult across several axes, requiring us to deal with high-dimensional inputs (space), long horizons (time), and multiple and new tasks. Recent advances with architectures have allowed for improved scaling along one or two of these dimensions, but are still prohibitive computationally. In this paper, we propose to address all three axes by leveraging Language to Control Diffusion models as a hierarchical planner conditioned on language (LCD). We effectively and efficiently scale diffusion models for planning in extended temporal, state, and task dimensions to tackle long horizon control problems conditioned on natural language instructions. We compare LCD with other state-of-the-art models on the CALVIN language robotics benchmark and find that LCD outperforms other SOTA methods in multi task success rates while dramatically improving computational efficiency with a single task success rate (SR) of 88.7% against the previous best of 82.6%. We show that LCD can successfully leverage the unique strength of diffusion models to produce coherent long range plans while addressing their weakness at generating low-level details and control. We release our code and models at https://github.com/ezhang7423/language-control-diffusion.
翻译:训练通用智能体在多个维度上面临困难,需要处理高维输入(空间)、长程时间跨度(时间)以及多种新任务。近期架构方面的进展使得沿其中一两个维度进行扩展成为可能,但计算成本仍然高昂。本文提出通过利用语言控制扩散模型作为基于语言条件的分层规划器(LCD),同时解决这三个维度的问题。我们高效地扩展了用于规划的扩散模型,使其能够在扩展的时间、状态和任务维度上处理基于自然语言指令的长时域控制问题。在CALVIN语言机器人基准测试中,我们将LCD与其他最先进模型进行了比较,发现LCD在多任务成功率方面优于其他SOTA方法,同时显著提升了计算效率:单任务成功率为88.7%,而此前最佳方法仅为82.6%。我们证明,LCD能够成功利用扩散模型在生成连贯的长期规划方面的独特优势,同时弥补其在生成低级细节和控制方面的不足。我们在https://github.com/ezhang7423/language-control-diffusion上发布了代码和模型。