Training autonomous agents with sparse rewards is a long-standing problem in online reinforcement learning (RL), due to low data efficiency. Prior work overcomes this challenge by extracting useful knowledge from offline data, often accomplished through the learning of action distribution from offline data and utilizing the learned distribution to facilitate online RL. However, since the offline data are given and fixed, the extracted knowledge is inherently limited, making it difficult to generalize to new tasks. We propose a novel approach that leverages offline data to learn a generative diffusion model, coined as Adaptive Trajectory Diffuser (ATraDiff). This model generates synthetic trajectories, serving as a form of data augmentation and consequently enhancing the performance of online RL methods. The key strength of our diffuser lies in its adaptability, allowing it to effectively handle varying trajectory lengths and mitigate distribution shifts between online and offline data. Because of its simplicity, ATraDiff seamlessly integrates with a wide spectrum of RL methods. Empirical evaluation shows that ATraDiff consistently achieves state-of-the-art performance across a variety of environments, with particularly pronounced improvements in complicated settings. Our code and demo video are available at https://atradiff.github.io .
翻译:在在线强化学习(RL)中,由于稀疏奖励导致的数据效率低下,训练自主智能体一直是一个长期存在的难题。先前的研究通过从离线数据中提取有用知识来克服这一挑战,通常包括学习离线数据的动作分布,并利用习得的分布来促进在线强化学习。然而,由于离线数据是给定且固定的,所提取的知识本质上是有限的,难以泛化到新任务。我们提出了一种新颖方法,利用离线数据学习一个生成扩散模型,称为自适应轨迹扩散器(ATraDiff)。该模型生成合成轨迹,作为一种数据增强形式,从而提升在线强化学习方法的性能。我们扩散器的核心优势在于其自适应性,使其能够有效处理变化的轨迹长度,并缓解在线与离线数据之间的分布偏移。由于其简洁性,ATraDiff能够与广泛的强化学习方法无缝集成。实证评估表明,ATraDiff在多种环境中始终实现最先进的性能,在复杂场景中的改进尤为显著。我们的代码和演示视频可在 https://atradiff.github.io 获取。