Pre-trained large language models (LLMs) capture procedural knowledge about the world. Recent work has leveraged LLM's ability to generate abstract plans to simplify challenging control tasks, either by action scoring, or action modeling (fine-tuning). However, the transformer architecture inherits several constraints that make it difficult for the LLM to directly serve as the agent: e.g. limited input lengths, fine-tuning inefficiency, bias from pre-training, and incompatibility with non-text environments. To maintain compatibility with a low-level trainable actor, we propose to instead use the knowledge in LLMs to simplify the control problem, rather than solving it. We propose the Plan, Eliminate, and Track (PET) framework. The Plan module translates a task description into a list of high-level sub-tasks. The Eliminate module masks out irrelevant objects and receptacles from the observation for the current sub-task. Finally, the Track module determines whether the agent has accomplished each sub-task. On the AlfWorld instruction following benchmark, the PET framework leads to a significant 15% improvement over SOTA for generalization to human goal specifications.
翻译:预训练的大型语言模型(LLMs)捕捉了关于世界的程序性知识。近期工作利用LLM生成抽象计划的能力,通过动作评分或动作建模(微调)来简化具有挑战性的控制任务。然而,Transformer架构固有的若干限制(如输入长度有限、微调效率低下、预训练带来的偏差以及与文本无关环境的兼容性问题)使得LLM难以直接充当代理。为了保持与低级可训练执行器的兼容性,我们提出利用LLM中的知识来简化控制问题,而非直接求解。我们提出了“规划、消除与追踪”(PET)框架。其中,“规划”模块将任务描述转化为高层子任务列表;“消除”模块从当前子任务的观测中屏蔽无关物体和容器;最后,“追踪”模块判断代理是否已完成每个子任务。在AlfWorld指令遵循基准测试中,PET框架在泛化至人类目标规范方面较当前最优方法(SOTA)实现了显著提升,幅度达15%。