Existing reinforcement learning (RL) methods struggle with long-horizon robotic manipulation tasks, particularly those involving sparse rewards. While action chunking is a promising paradigm for robotic manipulation, using RL to directly learn continuous action chunks in a stable and data-efficient manner remains a critical challenge. This paper introduces AC3 (Actor-Critic for Continuous Chunks), a novel RL framework that learns to generate high-dimensional, continuous action sequences. To make this learning process stable and data-efficient, AC3 incorporates targeted stabilization mechanisms for both the actor and the critic. First, to ensure reliable policy improvement, the actor is trained with an asymmetric update rule, learning exclusively from successful trajectories. Second, to enable effective value learning despite sparse rewards, the critic's update is stabilized using intra-chunk $n$-step returns and further enriched by a self-supervised module providing intrinsic rewards at anchor points aligned with each action chunk. We conducted extensive experiments on 25 tasks from the BiGym and RLBench benchmarks. Results show that by using only a few demonstrations and a simple model architecture, AC3 achieves superior success rates on most tasks, validating its effective design.
翻译:现有强化学习方法在处理长时程机器人操作任务时面临困难,尤其是在涉及稀疏奖励的场景中。尽管动作分块是机器人操作中一种有前景的范式,但如何利用强化学习以稳定且数据高效的方式直接学习连续动作块,仍然是一个关键挑战。本文提出AC3(面向连续动作块的演员-评论家方法),这是一种能够学习生成高维连续动作序列的新型强化学习框架。为使学习过程稳定且数据高效,AC3为演员和评论家分别引入了针对性的稳定机制。首先,为确保策略改进的可靠性,演员采用非对称更新规则进行训练,仅从成功轨迹中学习。其次,为在稀疏奖励条件下实现有效的价值学习,评论家的更新通过使用块内$n$步回报得以稳定,并进一步通过自监督模块进行增强,该模块在与每个动作块对齐的锚点处提供内在奖励。我们在BiGym和RLBench基准测试的25项任务上进行了广泛实验。结果表明,仅使用少量演示样本和简单模型架构,AC3在多数任务上实现了更高的成功率,验证了其设计的有效性。