Training reinforcement learning (RL) agents on robotic tasks typically requires a large number of training samples. This is because training data often consists of noisy trajectories, whether from exploration or human-collected demonstrations, making it difficult to learn value functions that understand the effect of taking each action. On the other hand, recent behavior-cloning (BC) approaches have shown that predicting a sequence of actions enables policies to effectively approximate noisy, multi-modal distributions of expert demonstrations. Can we use a similar idea for improving RL on robotic tasks? In this paper, we introduce a novel RL algorithm that learns a critic network that outputs Q-values over a sequence of actions. By explicitly training the value functions to learn the consequence of executing a series of current and future actions, our algorithm allows for learning useful value functions from noisy trajectories. We study our algorithm across various setups with sparse and dense rewards, and with or without demonstrations, spanning mobile bi-manual manipulation, whole-body control, and tabletop manipulation tasks from BiGym, HumanoidBench, and RLBench. We find that, by learning the critic network with action sequences, our algorithm outperforms various RL and BC baselines, in particular on challenging humanoid control tasks.
翻译:在机器人任务上训练强化学习(RL)智能体通常需要大量训练样本。这是因为训练数据通常包含噪声轨迹,无论是来自探索还是人工收集的演示,这使得学习能够理解每个动作效果的价值函数变得困难。另一方面,最近的行为克隆(BC)方法表明,预测动作序列能使策略有效逼近专家演示的噪声多模态分布。我们能否将类似思路用于改进机器人任务中的强化学习?本文提出一种新颖的强化学习算法,该算法学习一个输出动作序列Q值的评论家网络。通过显式训练价值函数以学习执行一系列当前及未来动作的后果,我们的算法能够从噪声轨迹中学习有用的价值函数。我们在稀疏与稠密奖励、有无演示的多种设置下研究该算法,涵盖来自BiGym、HumanoidBench和RLBench的移动双臂操作、全身控制及桌面操作任务。实验发现,通过使用动作序列学习评论家网络,我们的算法在多种强化学习和行为克隆基线方法中表现优异,尤其在具有挑战性的人形机器人控制任务上。