Offline goal-conditioned reinforcement learning (GCRL) can be challenging due to overfitting to the given dataset. To generalize agents' skills outside the given dataset, we propose a goal-swapping procedure that generates additional trajectories. To alleviate the problem of noise and extrapolation errors, we present a general offline reinforcement learning method called deterministic Q-advantage policy gradient (DQAPG). In the experiments, DQAPG outperforms state-of-the-art goal-conditioned offline RL methods in a wide range of benchmark tasks, and goal-swapping further improves the test results. It is noteworthy, that the proposed method obtains good performance on the challenging dexterous in-hand manipulation tasks for which the prior methods failed.
翻译:离线目标条件强化学习(GCRL)因对给定数据集的过拟合而面临挑战。为使智能体的技能泛化至给定数据集之外,我们提出了一种目标交换过程,该过程可生成额外的轨迹。为缓解噪声和外推误差问题,我们提出了一种名为确定性Q优势策略梯度(DQAPG)的通用离线强化学习方法。实验结果表明,在广泛的基准任务中,DQAPG优于最先进的目标条件离线强化学习方法,而目标交换进一步提升了测试结果。值得注意的是,所提方法在先期方法失效的复杂灵巧手内操作任务上取得了优异性能。