We introduce AMAGO, an in-context Reinforcement Learning (RL) agent that uses sequence models to tackle the challenges of generalization, long-term memory, and meta-learning. Recent works have shown that off-policy learning can make in-context RL with recurrent policies viable. Nonetheless, these approaches require extensive tuning and limit scalability by creating key bottlenecks in agents' memory capacity, planning horizon, and model size. AMAGO revisits and redesigns the off-policy in-context approach to successfully train long-sequence Transformers over entire rollouts in parallel with end-to-end RL. Our agent is uniquely scalable and applicable to a wide range of problems. We demonstrate its strong performance empirically in meta-RL and long-term memory domains. AMAGO's focus on sparse rewards and off-policy data also allows in-context learning to extend to goal-conditioned problems with challenging exploration. When combined with a novel hindsight relabeling scheme, AMAGO can solve a previously difficult category of open-world domains, where agents complete many possible instructions in procedurally generated environments. We evaluate our agent on three goal-conditioned domains and study how its individual improvements connect to create a generalist policy.
翻译:我们提出AMAGO,一种基于情境强化学习的智能体,通过序列模型应对泛化、长期记忆和元学习的挑战。近期研究表明,离策略学习能使基于循环策略的情境强化学习具有可行性。然而,这些方法需要大量调参,并通过在智能体的记忆容量、规划视野和模型规模上制造关键瓶颈而限制了可扩展性。AMAGO重新审视并设计了离策略情境化方法,成功实现了通过端到端强化学习并行训练覆盖完整轨迹的长序列Transformer。我们的智能体具有独特的可扩展性,可适用于广泛问题域。我们在元强化学习和长期记忆领域通过实验验证了其卓越性能。AMAGO对稀疏奖励和离策略数据的侧重,使得情境学习能扩展至具有挑战性探索的目标条件问题。结合新型事后重标注机制后,AMAGO可解决此前困难类别的开放式世界领域问题——智能体需在程序化生成环境中完成大量可能指令。我们在三个目标条件域上评估了该智能体,并剖析了其各项改进如何协同构建通用策略。