In reinforcement learning (RL), we train a value function to understand the long-term consequence of executing a single action. However, the value of taking each action can be ambiguous in robotics as robot movements are typically the aggregate result of executing multiple small actions. Moreover, robotic training data often consists of noisy trajectories, in which each action is noisy but executing a series of actions results in a meaningful robot movement. This further makes it difficult for the value function to understand the effect of individual actions. To address this, we introduce Coarse-to-fine Q-Network with Action Sequence (CQN-AS), a novel value-based RL algorithm that learns a critic network that outputs Q-values over a sequence of actions, i.e., explicitly training the value function to learn the consequence of executing action sequences. We study our algorithm on 53 robotic tasks with sparse and dense rewards, as well as with and without demonstrations, from BiGym, HumanoidBench, and RLBench. We find that CQN-AS outperforms various baselines, in particular on humanoid control tasks.
翻译:在强化学习(RL)中,我们通过训练价值函数来理解执行单个动作的长期影响。然而,在机器人学中,评估每个动作的价值可能具有模糊性,因为机器人的运动通常是执行多个微小动作的聚合结果。此外,机器人训练数据通常包含噪声轨迹,其中每个动作都存在噪声,但执行一系列动作却能产生有意义的机器人运动。这进一步增加了价值函数理解单个动作效果的难度。为解决这一问题,我们提出了基于动作序列的从粗到精Q网络(CQN-AS),这是一种新颖的基于价值的强化学习算法。该算法通过学习一个评判网络,输出动作序列的Q值,即显式训练价值函数以学习执行动作序列的后果。我们在BiGym、HumanoidBench和RLBench平台的53个机器人任务上验证了算法性能,这些任务涵盖稀疏奖励与密集奖励场景,以及有无演示数据的情况。实验结果表明,CQN-AS在多种基线方法中表现优异,特别是在人形机器人控制任务上具有显著优势。