Despite recent progress in reinforcement learning (RL) from raw pixel data, sample inefficiency continues to present a substantial obstacle. Prior works have attempted to address this challenge by creating self-supervised auxiliary tasks, aiming to enrich the agent's learned representations with control-relevant information for future state prediction. However, these objectives are often insufficient to learn representations that can represent the optimal policy or value function, and they often consider tasks with small, abstract discrete action spaces and thus overlook the importance of action representation learning in continuous control. In this paper, we introduce TACO: Temporal Action-driven Contrastive Learning, a simple yet powerful temporal contrastive learning approach that facilitates the concurrent acquisition of latent state and action representations for agents. TACO simultaneously learns a state and an action representation by optimizing the mutual information between representations of current states paired with action sequences and representations of the corresponding future states. Theoretically, TACO can be shown to learn state and action representations that encompass sufficient information for control, thereby improving sample efficiency. For online RL, TACO achieves 40% performance boost after one million environment interaction steps on average across nine challenging visual continuous control tasks from Deepmind Control Suite. In addition, we show that TACO can also serve as a plug-and-play module adding to existing offline visual RL methods to establish the new state-of-the-art performance for offline visual RL across offline datasets with varying quality.
翻译:尽管从原始像素数据中开展强化学习(RL)已取得近期进展,但样本效率低下仍是重大障碍。以往研究尝试通过构建自监督辅助任务来解决该问题,旨在通过富含控制相关信息的未来状态预测来增强智能体的表征学习。然而,这些目标通常不足以学习能表征最优策略或价值函数的表征,且常局限于具有小型抽象离散动作空间的任务,从而忽视了连续控制中动作表征学习的重要性。本文提出TACO(时间动作驱动对比学习法)——一种简洁而强大的时间对比学习方法,可同时促进智能体潜状态与动作表征的学习。TACO通过优化当前状态-动作序列联合表征与对应未来状态表征之间的互信息,同步学习状态与动作表征。理论证明,TACO能学习到包含充分控制信息的的状态与动作表征,从而提升样本效率。在在线强化学习中,基于Deepmind Control Suite的九类高难度视觉连续控制任务,TACO在百万环境交互步数后平均实现了40%的性能提升。此外,TACO可作为即插即用模块集成至现有离线视觉RL方法,在各类质量的离线数据集上均能建立离线视觉RL的新最优性能基准。