Recent SOTA approaches for embodied learning via interaction directly employ large language models (LLMs) as agents to determine the next steps in an environment. Due to their world knowledge and reasoning capabilities, LLM agents achieve stronger performance than previous smaller agents based on reinforcement learning (RL); however, frequently calling LLMs is slow and expensive. Instead of directly employing LLMs as agents, can we use LLMs' reasoning capabilities to adaptively create training environments to help smaller RL agents learn useful skills that they are weak at? We propose EnvGen, a novel framework to address this question. We first prompt an LLM to generate training environments by giving it the task description and simulator objectives that the agents should learn and then asking it to generate a set of environment configurations (e.g., different terrains, items initially given to agents, etc.). Next, we train a small RL agent in a mixture of the original and LLM-generated environments. Then, we enable the LLM to continuously adapt the generated environments to progressively improve the skills that the agent is weak at, by providing feedback to the LLM in the form of the agent's performance. We demonstrate the usefulness of EnvGen with comprehensive experiments in Crafter and Heist environments. We find that a small RL agent trained with EnvGen can outperform SOTA methods, including a GPT-4 agent, and learns long-horizon tasks significantly faster. We also show that using an LLM to adapt environments dynamically outperforms curriculum learning approaches and how the environments are adapted to help improve RL agents' weaker skills over time. Additionally, EnvGen is substantially more efficient as it only uses a small number of LLM calls (e.g., 4 in total), whereas LLM agents require thousands of calls. Lastly, we present detailed ablation studies for EnvGen design choices.
翻译:近期,通过交互进行具身学习的先进方法直接采用大型语言模型(LLM)作为智能体,以决定在环境中的下一步行动。凭借其世界知识和推理能力,LLM智能体实现了比以往基于强化学习(RL)的较小智能体更强的性能;然而,频繁调用LLM速度慢且成本高。我们能否不直接将LLM用作智能体,而是利用LLM的推理能力自适应地创建训练环境,以帮助较小的RL智能体学习其薄弱的有用技能?为此,我们提出了EnvGen这一新颖框架。我们首先提示LLM生成训练环境,方法是向其提供智能体应学习的任务描述和模拟器目标,然后要求其生成一组环境配置(例如,不同的地形、初始给予智能体的物品等)。接着,我们在原始环境和LLM生成的环境混合体中训练一个小型RL智能体。随后,我们通过以智能体性能的形式向LLM提供反馈,使LLM能够持续调整生成的环境,以逐步提升智能体薄弱环节的技能。我们在Crafter和Heist环境中通过全面的实验证明了EnvGen的有效性。我们发现,使用EnvGen训练的小型RL智能体可以超越包括GPT-4智能体在内的先进方法,并且学习长时程任务的速度显著更快。我们还展示了使用LLM动态调整环境优于课程学习方法,以及环境如何随时间调整以帮助提升RL智能体的薄弱技能。此外,EnvGen的效率显著更高,因为它仅使用少量LLM调用(例如总计4次),而LLM智能体则需要数千次调用。最后,我们对EnvGen的设计选择进行了详细的消融研究。