Language Models and Vision Language Models have recently demonstrated unprecedented capabilities in terms of understanding human intentions, reasoning, scene understanding, and planning-like behaviour, in text form, among many others. In this work, we investigate how to embed and leverage such abilities in Reinforcement Learning (RL) agents. We design a framework that uses language as the core reasoning tool, exploring how this enables an agent to tackle a series of fundamental RL challenges, such as efficient exploration, reusing experience data, scheduling skills, and learning from observations, which traditionally require separate, vertically designed algorithms. We test our method on a sparse-reward simulated robotic manipulation environment, where a robot needs to stack a set of objects. We demonstrate substantial performance improvements over baselines in exploration efficiency and ability to reuse data from offline datasets, and illustrate how to reuse learned skills to solve novel tasks or imitate videos of human experts.
翻译:语言模型和视觉语言模型最近在理解人类意图、推理、场景理解以及文本形式的类规划行为等方面展现出前所未有的能力。本研究探讨如何将这些能力嵌入并应用于强化学习智能体。我们设计了一个以语言为核心推理工具的框架,探索该框架如何使智能体应对一系列基础强化学习挑战,例如高效探索、经验数据重用、技能调度以及从观察中学习——这些任务传统上需要独立、垂直设计的算法来实现。我们在一个稀疏奖励的模拟机器人操作环境中测试了该方法,该环境中机器人需堆叠一组物体。实验结果表明,与基线相比,我们在探索效率、离线数据集数据重用能力方面取得了显著提升,并展示了如何重用已学技能以解决新任务或模仿人类专家视频。