Offline reinforcement learning (RL) aims to infer sequential decision policies using only offline datasets. This is a particularly difficult setup, especially when learning to achieve multiple different goals or outcomes under a given scenario with only sparse rewards. For offline learning of goal-conditioned policies via supervised learning, previous work has shown that an advantage weighted log-likelihood loss guarantees monotonic policy improvement. In this work we argue that, despite its benefits, this approach is still insufficient to fully address the distribution shift and multi-modality problems. The latter is particularly severe in long-horizon tasks where finding a unique and optimal policy that goes from a state to the desired goal is challenging as there may be multiple and potentially conflicting solutions. To tackle these challenges, we propose a complementary advantage-based weighting scheme that introduces an additional source of inductive bias: given a value-based partitioning of the state space, the contribution of actions expected to lead to target regions that are easier to reach, compared to the final goal, is further increased. Empirically, we demonstrate that the proposed approach, Dual-Advantage Weighted Offline Goal-conditioned RL (DAWOG), outperforms several competing offline algorithms in commonly used benchmarks. Analytically, we offer a guarantee that the learnt policy is never worse than the underlying behaviour policy.
翻译:离线强化学习旨在仅利用离线数据集推断序列决策策略。这是一个特别困难的设置,尤其是在稀疏奖励场景下学习实现多个不同目标或结果时。先前研究表明,通过监督学习进行目标条件策略的离线学习时,优势加权对数似然损失能保证单调的策略改进。在本文中,我们认为该方法尽管具有优势,但仍不足以完全解决分布偏移和多模态问题。后者在长视界任务中尤为严重——从某个状态通往期望目标时,因存在多个可能相互矛盾的解,寻找唯一最优策略极具挑战。为应对这些难题,我们提出一种基于优势的互补加权方案,引入额外归纳偏置:基于状态空间的价值划分,预期能引导到比最终目标更易达成的目标区域的动作贡献将被进一步增强。实验表明,所提出的方法DAWOG(双优势加权离线目标条件强化学习)在常用基准测试中优于多种竞争性离线算法。理论分析证明,该学习策略性能始终不低于底层行为策略。