Offline policy learning is aimed at learning decision-making policies using existing datasets of trajectories without collecting additional data. The primary motivation for using reinforcement learning (RL) instead of supervised learning techniques such as behavior cloning is to find a policy that achieves a higher average return than the trajectories constituting the dataset. However, we empirically find that when a dataset is dominated by suboptimal trajectories, state-of-the-art offline RL algorithms do not substantially improve over the average return of trajectories in the dataset. We argue this is due to an assumption made by current offline RL algorithms of staying close to the trajectories in the dataset. If the dataset primarily consists of sub-optimal trajectories, this assumption forces the policy to mimic the suboptimal actions. We overcome this issue by proposing a sampling strategy that enables the policy to only be constrained to ``good data" rather than all actions in the dataset (i.e., uniform sampling). We present a realization of the sampling strategy and an algorithm that can be used as a plug-and-play module in standard offline RL algorithms. Our evaluation demonstrates significant performance gains in 72 imbalanced datasets, D4RL dataset, and across three different offline RL algorithms. Code is available at https://github.com/Improbable-AI/dw-offline-rl.
翻译:离线策略学习旨在利用现有轨迹数据集进行决策策略学习,无需额外收集数据。相较于行为克隆等监督学习方法,采用强化学习的核心动机是寻找能获得比数据集中轨迹平均回报更高回报的策略。然而,我们的实证研究发现,当数据集主要由次优轨迹主导时,现有最先进的离线强化学习算法并未显著提升数据集轨迹的平均回报。我们认为这是由于当前离线强化学习算法隐含的“贴近数据集轨迹”假设所致。若数据集主要包含次优轨迹,该假设将迫使策略模仿次优动作。为解决此问题,我们提出一种采样策略,使策略仅需约束于“优质数据”而非数据集中的所有动作(即均匀采样)。我们实现了该采样策略及其算法,可作为即插即用模块集成至标准离线强化学习算法中。在72个非均衡数据集、D4RL数据集及三种不同离线强化学习算法上的评估表明,该方法取得了显著性能提升。代码开源于https://github.com/Improbable-AI/dw-offline-rl。