Predicting a sequence of actions has been crucial in the success of recent behavior cloning algorithms in robotics. Can similar ideas improve reinforcement learning (RL)? We answer affirmatively by observing that incorporating action sequences when predicting ground-truth return-to-go leads to lower validation loss. Motivated by this, we introduce Coarse-to-fine Q-Network with Action Sequence (CQN-AS), a novel value-based RL algorithm that learns a critic network that outputs Q-values over a sequence of actions, i.e., explicitly training the value function to learn the consequence of executing action sequences. Our experiments show that CQN-AS outperforms several baselines on a variety of sparse-reward humanoid control and tabletop manipulation tasks from BiGym and RLBench.
翻译:在机器人学中,预测动作序列已成为近期行为克隆算法成功的关键。类似思路能否改进强化学习(RL)?我们通过实验给出了肯定回答:在预测真实回报目标时引入动作序列可降低验证损失。受此启发,我们提出了基于动作序列的从粗到精Q网络(CQN-AS),这是一种基于价值的新型RL算法。该算法训练评论家网络输出针对动作序列的Q值,即显式训练价值函数以学习执行动作序列的后果。实验表明,在BiGym和RLBench平台提供的多种稀疏奖励人形控制与桌面操作任务中,CQN-AS性能优于多个基线模型。