Offline reinforcement learning (RL) is challenged by the distributional shift problem. To address this problem, existing works mainly focus on designing sophisticated policy constraints between the learned policy and the behavior policy. However, these constraints are applied equally to well-performing and inferior actions through uniform sampling, which might negatively affect the learned policy. To alleviate this issue, we propose Offline Prioritized Experience Replay (OPER), featuring a class of priority functions designed to prioritize highly-rewarding transitions, making them more frequently visited during training. Through theoretical analysis, we show that this class of priority functions induce an improved behavior policy, and when constrained to this improved policy, a policy-constrained offline RL algorithm is likely to yield a better solution. We develop two practical strategies to obtain priority weights by estimating advantages based on a fitted value network (OPER-A) or utilizing trajectory returns (OPER-R) for quick computation. OPER is a plug-and-play component for offline RL algorithms. As case studies, we evaluate OPER on five different algorithms, including BC, TD3+BC, Onestep RL, CQL, and IQL. Extensive experiments demonstrate that both OPER-A and OPER-R significantly improve the performance for all baseline methods. Codes and priority weights are availiable at https://github.com/sail-sg/OPER.
翻译:离线强化学习面临分布偏移问题的挑战。为解决该问题,现有研究主要集中于设计学习策略与行为策略之间的复杂约束。然而,这些约束通过均匀采样同等应用于表现优秀和低劣的动作,可能对学习策略产生负面影响。为缓解这一问题,我们提出离线优先经验回放(OPER),其包含一类旨在优先处理高奖励经验样本的优先级函数,使这些样本在训练过程中被更频繁地访问。通过理论分析,我们证明这类优先级函数可诱导出更优的行为策略,当约束于该优化策略时,基于策略约束的离线强化学习算法更可能获得更优解。我们开发了两种实用策略来获取优先级权重:基于拟合价值网络估计优势值(OPER-A),或利用轨迹回报(OPER-R)快速计算。OPER可作为离线强化学习算法的即插即用组件。作为案例研究,我们在五种不同算法(包括BC、TD3+BC、Onestep RL、CQL和IQL)上评估OPER。大量实验表明,OPER-A和OPER-R均显著提升了所有基线方法的性能。代码和优先级权重已开源至https://github.com/sail-sg/OPER。