Offline reinforcement learning (RL) is challenged by the distributional shift problem. To address this problem, existing works mainly focus on designing sophisticated policy constraints between the learned policy and the behavior policy. However, these constraints are applied equally to well-performing and inferior actions through uniform sampling, which might negatively affect the learned policy. To alleviate this issue, we propose Offline Prioritized Experience Replay (OPER), featuring a class of priority functions designed to prioritize highly-rewarding transitions, making them more frequently visited during training. Through theoretical analysis, we show that this class of priority functions induce an improved behavior policy, and when constrained to this improved policy, a policy-constrained offline RL algorithm is likely to yield a better solution. We develop two practical strategies to obtain priority weights by estimating advantages based on a fitted value network (OPER-A) or utilizing trajectory returns (OPER-R) for quick computation. OPER is a plug-and-play component for offline RL algorithms. As case studies, we evaluate OPER on five different algorithms, including BC, TD3+BC, Onestep RL, CQL, and IQL. Extensive experiments demonstrate that both OPER-A and OPER-R significantly improve the performance for all baseline methods. Codes and priority weights are availiable at https://github.com/sail-sg/OPER.
翻译:离线强化学习面临分布偏移问题的挑战。为解决该问题,现有工作主要关注在学习策略与行为策略之间设计复杂的策略约束。然而,这些约束通过均匀采样被同等应用于表现良好与较差的动作,这可能会对学习策略产生负面影响。为缓解此问题,我们提出离线优先级经验回放(OPER),其核心是一类旨在优先处理高奖励转移的优先级函数,使这些转移在训练过程中被更频繁地访问。通过理论分析,我们证明这类优先级函数能够诱导出改进的行为策略,并且当约束作用于该改进策略时,基于策略约束的离线强化学习算法更有可能获得更优解。我们开发了两种实用策略来获取优先级权重:基于拟合价值网络估计优势(OPER-A)或利用轨迹回报(OPER-R)进行快速计算。OPER可作为离线强化学习算法的即插即用组件。作为案例研究,我们在五种不同算法(包括BC、TD3+BC、Onestep RL、CQL和IQL)上评估了OPER。大量实验表明,OPER-A和OPER-R均能显著提升所有基线方法的性能。代码和优先级权重可在https://github.com/sail-sg/OPER获取。