In this work, we address the problem of learning optimal behavior from sub-optimal datasets in the context of goal-conditioned offline reinforcement learning. To do so, we propose a novel way of approximating the optimal value function for goal-conditioned offline RL problems under sparse rewards, symmetric and deterministic actions. We study a property for representations to recover optimality and propose a new optimization objective that leads to such property. We use the learned value function to guide the learning of a policy in an actor-critic fashion, a method we name MetricRL. Experimentally, we show how our method consistently outperforms other offline RL baselines in learning from sub-optimal offline datasets. Moreover, we show the effectiveness of our method in dealing with high-dimensional observations and in multi-goal tasks.
翻译:本文针对目标条件离线强化学习场景下,从次优数据集中学习最优行为的问题展开研究。我们提出了一种新颖的方法,用于在稀疏奖励、对称且确定性动作条件下近似求解目标条件离线强化学习问题的最优价值函数。通过分析表征恢复最优性的性质,我们提出了能实现该性质的新型优化目标。采用演员-评论家范式,利用学习到的价值函数指导策略学习,该方法被命名为MetricRL。实验结果表明,在基于次优离线数据集的训练中,我们的方法始终优于其他离线强化学习基线方法。此外,该方法在处理高维观测数据及多目标任务中均展现出显著有效性。