Pre-trained and frozen large language models (LLMs) can effectively map simple scene rearrangement instructions to programs over a robot's visuomotor functions through appropriate few-shot example prompting. To parse open-domain natural language and adapt to a user's idiosyncratic procedures, not known during prompt engineering time, fixed prompts fall short. In this paper, we introduce HELPER, an embodied agent equipped with an external memory of language-program pairs that parses free-form human-robot dialogue into action programs through retrieval-augmented LLM prompting: relevant memories are retrieved based on the current dialogue, instruction, correction, or VLM description, and used as in-context prompt examples for LLM querying. The memory is expanded during deployment to include pairs of user's language and action plans, to assist future inferences and personalize them to the user's language and routines. HELPER sets a new state-of-the-art in the TEACh benchmark in both Execution from Dialog History (EDH) and Trajectory from Dialogue (TfD), with a 1.7x improvement over the previous state-of-the-art for TfD. Our models, code, and video results can be found in our project's website: https://helper-agent-llm.github.io.
翻译:预训练且冻结的大语言模型通过适当的少样本示例提示,能够有效地将简单场景重排指令映射至机器人视觉运动功能的操作程序。然而,当需要解析开放域自然语言并适应用户在提示工程阶段未知的特殊流程时,固定提示模板存在明显局限。本文提出HELPER——一种配备语言-程序对外部记忆库的具身智能体,通过检索增强的大语言模型提示机制将自由形式的人机对话解析为动作程序:基于当前对话、指令、修正或视觉语言模型描述检索相关记忆,并将其作为大语言模型查询的上下文提示示例。在部署过程中,系统会持续扩展记忆库以包含用户语言与动作计划的配对,从而辅助未来推理并使其个性化适应用户的语言习惯与操作流程。HELPER在TEACh基准测试中实现了对话历史执行与对话轨迹两个任务的最新最优性能,其中对话轨迹任务的性能较此前最优方法提升1.7倍。模型、代码及视频结果详见项目网站:https://helper-agent-llm.github.io。