This letter introduces ERRA, an embodied learning architecture that enables robots to jointly obtain three fundamental capabilities (reasoning, planning, and interaction) for solving long-horizon language-conditioned manipulation tasks. ERRA is based on tightly-coupled probabilistic inferences at two granularity levels. Coarse-resolution inference is formulated as sequence generation through a large language model, which infers action language from natural language instruction and environment state. The robot then zooms to the fine-resolution inference part to perform the concrete action corresponding to the action language. Fine-resolution inference is constructed as a Markov decision process, which takes action language and environmental sensing as observations and outputs the action. The results of action execution in environments provide feedback for subsequent coarse-resolution reasoning. Such coarse-to-fine inference allows the robot to decompose and achieve long-horizon tasks interactively. In extensive experiments, we show that ERRA can complete various long-horizon manipulation tasks specified by abstract language instructions. We also demonstrate successful generalization to the novel but similar natural language instructions.
翻译:本文提出ERRA,一种具身学习架构,使机器人能够协同获得解决长时域语言条件操控任务所需的三种基本能力(推理、规划与交互)。ERRA基于两个粒度层级上的紧耦合概率推理。粗粒度推理通过大语言模型表述为序列生成过程,从自然语言指令与环境状态中推断动作语言;随后机器人切换至细粒度推理模块,执行与动作语言对应的具体行为。细粒度推理被构建为马尔可夫决策过程,以动作语言与环境感知为观测输入,输出具体动作。环境中的动作执行结果为后续粗粒度推理提供反馈信号。这种从粗到细的推理范式使机器人能够通过交互方式分解并完成长时域任务。大量实验表明,ERRA可完成由抽象语言指令指定的多种长时域操控任务,并成功泛化至新颖但语义相似的自然语言指令。