Pre-trained large language models (LLMs) capture procedural knowledge about the world. Recent work has leveraged LLM's ability to generate abstract plans to simplify challenging control tasks, either by action scoring, or action modeling (fine-tuning). However, the transformer architecture inherits several constraints that make it difficult for the LLM to directly serve as the agent: e.g. limited input lengths, fine-tuning inefficiency, bias from pre-training, and incompatibility with non-text environments. To maintain compatibility with a low-level trainable actor, we propose to instead use the knowledge in LLMs to simplify the control problem, rather than solving it. We propose the Plan, Eliminate, and Track (PET) framework. The Plan module translates a task description into a list of high-level sub-tasks. The Eliminate module masks out irrelevant objects and receptacles from the observation for the current sub-task. Finally, the Track module determines whether the agent has accomplished each sub-task. On the AlfWorld instruction following benchmark, the PET framework leads to a significant 15% improvement over SOTA for generalization to human goal specifications.
翻译:预训练的大语言模型(LLMs)捕获了关于世界的程序性知识。近期研究利用LLMs生成抽象计划的能力,通过动作评分或动作建模(微调)来简化具有挑战性的控制任务。然而,Transformer架构本身存在若干限制,使得LLMs难以直接充当代理:例如输入长度有限、微调效率低下、预训练带来的偏差,以及与非文本环境的不兼容。为保持与底层可训练执行器的兼容性,我们提出利用LLMs中的知识来简化控制问题,而非直接求解。我们提出了"计划、消除与追踪"(PET)框架。其中,计划模块将任务描述转化为高层子任务列表;消除模块针对当前子任务从观测中屏蔽无关物体和容器;追踪模块则判定代理是否已完成各子任务。在AlfWorld指令遵循基准测试中,PET框架在泛化至人类目标描述的任务上,相较于现有最优方法实现了15%的显著提升。