基于动作分块的强化学习 (Reinforcement Learning with Action Chunking)

We present Q-chunking, a simple yet effective recipe for improving reinforcement learning (RL) algorithms for long-horizon, sparse-reward tasks. Our recipe is designed for the offline-to-online RL setting, where the goal is to leverage an offline prior dataset to maximize the sample-efficiency of online learning. Effective exploration and sample-efficient learning remain central challenges in this setting, as it is not obvious how the offline data should be utilized to acquire a good exploratory policy. Our key insight is that action chunking, a technique popularized in imitation learning where sequences of future actions are predicted rather than a single action at each timestep, can be applied to temporal difference (TD)-based RL methods to mitigate the exploration challenge. Q-chunking adopts action chunking by directly running RL in a 'chunked' action space, enabling the agent to (1) leverage temporally consistent behaviors from offline data for more effective online exploration and (2) use unbiased $n$-step backups for more stable and efficient TD learning. Our experimental results demonstrate that Q-chunking exhibits strong offline performance and online sample efficiency, outperforming prior best offline-to-online methods on a range of long-horizon, sparse-reward manipulation tasks.

翻译：我们提出Q-chunking，这是一种简单而有效的改进方案，用于提升强化学习（RL）算法在长时程、稀疏奖励任务中的性能。我们的方案专为离线到在线RL设置设计，其目标在于利用离线先验数据集来最大化在线学习的样本效率。在该设置中，有效的探索和样本高效学习仍然是核心挑战，因为如何利用离线数据来获得一个良好的探索策略并不明确。我们的关键见解是，动作分块——一种在模仿学习中流行起来的技术，即预测未来动作序列而非每个时间步的单个动作——可以应用于基于时序差分（TD）的RL方法，以缓解探索挑战。Q-chunking通过在“分块”的动作空间中直接运行RL来采用动作分块，使智能体能够（1）利用离线数据中时间一致的行为进行更有效的在线探索，以及（2）使用无偏的$n$步备份进行更稳定、更高效的TD学习。我们的实验结果表明，Q-chunking展现出强大的离线性能和在线样本效率，在一系列长时程、稀疏奖励的操作任务上超越了先前最佳的离线到在线方法。