Offline reinforcement learning (RL) struggles in environments with rich and noisy inputs, where the agent only has access to a fixed dataset without environment interactions. Past works have proposed common workarounds based on the pre-training of state representations, followed by policy training. In this work, we introduce a simple, yet effective approach for learning state representations. Our method, Behavior Prior Representation (BPR), learns state representations with an easy-to-integrate objective based on behavior cloning of the dataset: we first learn a state representation by mimicking actions from the dataset, and then train a policy on top of the fixed representation, using any off-the-shelf Offline RL algorithm. Theoretically, we prove that BPR carries out performance guarantees when integrated into algorithms that have either policy improvement guarantees (conservative algorithms) or produce lower bounds of the policy values (pessimistic algorithms). Empirically, we show that BPR combined with existing state-of-the-art Offline RL algorithms leads to significant improvements across several offline control benchmarks. The code is available at \url{https://github.com/bit1029public/offline_bpr}.
翻译:离线强化学习在具有丰富且嘈杂输入的复杂环境中面临挑战,此时智能体仅能访问固定数据集而无法与环境交互。先前的工作提出了基于状态表示预训练并随后进行策略训练的通用解决方案。本文引入了一种简单而有效的状态表示学习方法。我们的方法——行为先验表示(BPR),通过一种易于整合的基于数据集行为克隆的目标函数来学习状态表示:首先通过模仿数据集中的动作来学习状态表示,随后在固定表示的基础上,使用任意现成的离线强化学习算法训练策略。理论上,我们证明了当BPR整合到具有策略改进保证(保守算法)或能够生成策略值下界(悲观算法)的算法中时,它具有性能保证。实验表明,将BPR与现有最先进的离线强化学习算法相结合,可在多个离线控制基准测试中取得显著改进。代码已开源在\url{https://github.com/bit1029public/offline_bpr}。