In goal-conditioned reinforcement learning (GCRL), sparse rewards present significant challenges, often obstructing efficient learning. Although multi-step GCRL can boost this efficiency, it can also lead to off-policy biases in target values. This paper dives deep into these biases, categorizing them into two distinct categories: "shooting" and "shifting". Recognizing that certain behavior policies can hasten policy refinement, we present solutions designed to capitalize on the positive aspects of these biases while minimizing their drawbacks, enabling the use of larger step sizes to speed up GCRL. An empirical study demonstrates that our approach ensures a resilient and robust improvement, even in ten-step learning scenarios, leading to superior learning efficiency and performance that generally surpass the baseline and several state-of-the-art multi-step GCRL benchmarks.
翻译:在目标条件强化学习(GCRL)中,稀疏奖励带来了显著挑战,常常阻碍高效学习。尽管多步GCRL能够提升学习效率,但也可能导致目标值中的离策略偏置。本文深入探讨了这些偏置,将其分为两类:“射击偏置”和“偏移偏置”。鉴于某些行为策略能够加速策略优化,我们提出了旨在利用这些偏置的积极方面同时最小化其负面影响的解决方案,从而允许使用更大的步长来加速GCRL。实证研究表明,即使在十步学习场景中,我们的方法也能确保稳定且鲁棒的改进,从而带来优于基线及多个最先进多步GCRL基准的学习效率和性能。