Hierarchical control for robotics has long been plagued by the need to have a well defined interface layer to communicate between high-level task planners and low-level policies. With the advent of LLMs, language has been emerging as a prospective interface layer. However, this has several limitations. Not all tasks can be decomposed into steps that are easily expressible in natural language (e.g. performing a dance routine). Further, it makes end-to-end finetuning on embodied data challenging due to domain shift and catastrophic forgetting. We introduce our method -- Learnable Latent Codes as Bridges (LCB) -- as an alternate architecture to overcome these limitations. \method~uses a learnable latent code to act as a bridge between LLMs and low-level policies. This enables LLMs to flexibly communicate goals in the task plan without being entirely constrained by language limitations. Additionally, it enables end-to-end finetuning without destroying the embedding space of word tokens learned during pre-training. Through experiments on Language Table and Calvin, two common language based benchmarks for embodied agents, we find that \method~outperforms baselines (including those w/ GPT-4V) that leverage pure language as the interface layer on tasks that require reasoning and multi-step behaviors.
翻译:机器人分层控制长期以来一直受困于需要一个明确定义的接口层来沟通高层任务规划器和低层策略。随着大语言模型的出现,语言正逐渐成为一种潜在的接口层。然而,这种方法存在若干局限性。并非所有任务都能分解为易于用自然语言表达的步骤(例如执行舞蹈动作)。此外,由于领域偏移和灾难性遗忘问题,这使得在具身数据上进行端到端微调变得极具挑战性。我们提出了我们的方法——可学习潜在代码作为桥梁——作为一种替代架构来克服这些局限性。该方法使用可学习的潜在代码作为大语言模型与低层策略之间的桥梁。这使得大语言模型能够灵活地传达任务规划中的目标,而不完全受语言表达能力的限制。此外,该方法支持端到端微调,同时不会破坏预训练期间学习到的词元嵌入空间。通过在Language Table和Calvin这两个具身智能体常用的基于语言的基准测试上进行实验,我们发现,在需要推理和多步骤行为的任务上,该方法优于那些纯粹使用语言作为接口层的基线方法(包括使用GPT-4V的方法)。