Offline reinforcement learning (RL) allows agents to learn effective, return-maximizing policies from a static dataset. Three major paradigms for offline RL are Q-Learning, Imitation Learning, and Sequence Modeling. A key open question is: which paradigm is preferred under what conditions? We study this question empirically by exploring the performance of representative algorithms -- Conservative Q-Learning (CQL), Behavior Cloning (BC), and Decision Transformer (DT) -- across the commonly used D4RL and Robomimic benchmarks. We design targeted experiments to understand their behavior concerning data suboptimality and task complexity. Our key findings are: (1) Sequence Modeling requires more data than Q-Learning to learn competitive policies but is more robust; (2) Sequence Modeling is a substantially better choice than both Q-Learning and Imitation Learning in sparse-reward and low-quality data settings; and (3) Sequence Modeling and Imitation Learning are preferable as task horizon increases, or when data is obtained from human demonstrators. Based on the overall strength of Sequence Modeling, we also investigate architectural choices and scaling trends for DT on Atari and D4RL and make design recommendations. We find that scaling the amount of data for DT by 5x gives a 2.5x average score improvement on Atari.
翻译:离线强化学习(RL)允许智能体从静态数据集中学习有效的、最大化回报的策略。离线RL的三大主要范式是Q学习、模仿学习和序列建模。一个关键未解问题是:在何种条件下应优先选择哪种范式?我们通过实证研究探讨了该问题,分析了代表性算法——保守Q学习(CQL)、行为克隆(BC)和决策Transformer(DT)——在常用D4RL和Robomimic基准测试中的性能。我们设计了针对性实验以理解它们在数据次优性和任务复杂性方面的行为。主要发现如下:(1)序列建模比Q学习需要更多数据才能学到竞争性策略,但鲁棒性更强;(2)在稀疏奖励和低质量数据场景下,序列建模是比Q学习和模仿学习更优的选择;(3)随着任务时间跨度增加或数据来自人类演示者时,序列建模和模仿学习更具优势。基于序列建模的整体优势,我们还研究了Atari和D4RL上DT的架构选择与扩展趋势,并提出了设计建议。发现将DT的数据量扩展5倍可在Atari上带来2.5倍的平均得分提升。