Offline reinforcement learning (RL) defines the task of learning from a static logged dataset without continually interacting with the environment. The distribution shift between the learned policy and the behavior policy makes it necessary for the value function to stay conservative such that out-of-distribution (OOD) actions will not be severely overestimated. However, existing approaches, penalizing the unseen actions or regularizing with the behavior policy, are too pessimistic, which suppresses the generalization of the value function and hinders the performance improvement. This paper explores mild but enough conservatism for offline learning while not harming generalization. We propose Mildly Conservative Q-learning (MCQ), where OOD actions are actively trained by assigning them proper pseudo Q values. We theoretically show that MCQ induces a policy that behaves at least as well as the behavior policy and no erroneous overestimation will occur for OOD actions. Experimental results on the D4RL benchmarks demonstrate that MCQ achieves remarkable performance compared with prior work. Furthermore, MCQ shows superior generalization ability when transferring from offline to online, and significantly outperforms baselines. Our code is publicly available at https://github.com/dmksjfl/MCQ.
翻译:离线强化学习(RL)定义了从静态日志数据集中学习而不与环境持续交互的任务。学习策略与行为策略之间的分布偏移要求价值函数保持保守性,以避免对分布外(OOD)动作的严重高估。然而,现有方法——如惩罚未见动作或通过行为策略进行正则化——过于悲观,这会抑制价值函数的泛化能力,阻碍性能提升。本文探讨了离线学习中既温和又充分的保守性,同时不损害泛化能力。我们提出温和保守Q学习(MCQ),通过为OOD动作分配适当的伪Q值来主动训练这些动作。理论上我们证明,MCQ生成的策略至少与行为策略表现相当,且OOD动作不会出现错误的高估。在D4RL基准上的实验结果表明,与先前工作相比,MCQ取得了显著性能提升。此外,MCQ在从离线迁移到在线时展现出卓越的泛化能力,显著优于基线方法。我们的代码公开于https://github.com/dmksjfl/MCQ。