We present Q-chunking, a simple yet effective recipe for improving reinforcement learning (RL) algorithms for long-horizon, sparse-reward tasks. Our recipe is designed for the offline-to-online RL setting, where the goal is to leverage an offline prior dataset to maximize the sample-efficiency of online learning. Effective exploration and sample-efficient learning remain central challenges in this setting, as it is not obvious how the offline data should be utilized to acquire a good exploratory policy. Our key insight is that action chunking, a technique popularized in imitation learning where sequences of future actions are predicted rather than a single action at each timestep, can be applied to temporal difference (TD)-based RL methods to mitigate the exploration challenge. Q-chunking adopts action chunking by directly running RL in a 'chunked' action space, enabling the agent to (1) leverage temporally consistent behaviors from offline data for more effective online exploration and (2) use unbiased $n$-step backups for more stable and efficient TD learning. Our experimental results demonstrate that Q-chunking exhibits strong offline performance and online sample efficiency, outperforming prior best offline-to-online methods on a range of long-horizon, sparse-reward manipulation tasks.
翻译:我们提出Q-chunking,这是一种简单而有效的改进方案,用于提升强化学习(RL)算法在长时程、稀疏奖励任务中的性能。我们的方案专为离线到在线RL设置设计,其目标在于利用离线先验数据集来最大化在线学习的样本效率。在该设置中,有效的探索和样本高效学习仍然是核心挑战,因为如何利用离线数据来获得一个良好的探索策略并不明确。我们的关键见解是,动作分块——一种在模仿学习中流行起来的技术,即预测未来动作序列而非每个时间步的单个动作——可以应用于基于时序差分(TD)的RL方法,以缓解探索挑战。Q-chunking通过在“分块”的动作空间中直接运行RL来采用动作分块,使智能体能够(1)利用离线数据中时间一致的行为进行更有效的在线探索,以及(2)使用无偏的$n$步备份进行更稳定、更高效的TD学习。我们的实验结果表明,Q-chunking展现出强大的离线性能和在线样本效率,在一系列长时程、稀疏奖励的操作任务上超越了先前最佳的离线到在线方法。