Offline reinforcement learning aims to utilize datasets of previously gathered environment-action interaction records to learn a policy without access to the real environment. Recent work has shown that offline reinforcement learning can be formulated as a sequence modeling problem and solved via supervised learning with approaches such as decision transformer. While these sequence-based methods achieve competitive results over return-to-go methods, especially on tasks that require longer episodes or with scarce rewards, importance sampling is not considered to correct the policy bias when dealing with off-policy data, mainly due to the absence of behavior policy and the use of deterministic evaluation policies. To this end, we propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation (DPE) in a unified framework with statistically proven properties on variance reduction. We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks. Our method brings a performance improvements on selected methods which outperforms SOTA baselines in several tasks, demonstrating the advantages of enabling double policy estimation for sequence-modeled reinforcement learning.
翻译:离线强化学习旨在利用先前收集的环境-动作交互记录数据集学习策略,而无需访问真实环境。近期研究表明,离线强化学习可被形式化为序列建模问题,并通过决策Transformer等监督学习方法求解。尽管这类基于序列的方法在需要较长回合或奖励稀疏的任务中,相较于基于回报导向的方法取得了具有竞争力的结果,但由于行为策略的缺失以及确定性评估策略的使用,重要性采样未被用于修正离线策略数据中的策略偏差。为此,我们提出DPE算法:该算法将离线序列建模与离线强化学习相结合,通过双重策略估计(Double Policy Estimation, DPE)在统一框架中实现兼具统计可证明性质(方差缩减)的方法。我们在OpenAI Gym的多个任务中基于D4RL基准进行验证。所提方法在所选方法上实现了性能提升,在多个任务中超越了现有最优基线,这证明了在序列建模强化学习中启用双重策略估计具有显著优势。