Deep reinforcement learning (DRL) gives the promise that an agent learns good policy from high-dimensional information, whereas representation learning removes irrelevant and redundant information and retains pertinent information. In this work, we demonstrate that the learned representation of the $Q$-network and its target $Q$-network should, in theory, satisfy a favorable distinguishable representation property. Specifically, there exists an upper bound on the representation similarity of the value functions of two adjacent time steps in a typical DRL setting. However, through illustrative experiments, we show that the learned DRL agent may violate this property and lead to a sub-optimal policy. Therefore, we propose a simple yet effective regularizer called Policy Evaluation with Easy Regularization on Representation (PEER), which aims to maintain the distinguishable representation property via explicit regularization on internal representations. And we provide the convergence rate guarantee of PEER. Implementing PEER requires only one line of code. Our experiments demonstrate that incorporating PEER into DRL can significantly improve performance and sample efficiency. Comprehensive experiments show that PEER achieves state-of-the-art performance on all 4 environments on PyBullet, 9 out of 12 tasks on DMControl, and 19 out of 26 games on Atari. To the best of our knowledge, PEER is the first work to study the inherent representation property of Q-network and its target. Our code is available at https://sites.google.com/view/peer-cvpr2023/.
翻译:深度强化学习(DRL)使智能体能够从高维信息中学习良好策略,而表示学习则能去除无关和冗余信息,保留关键信息。在本文中,我们理论上证明了$Q$网络及其目标$Q$网络学到的表示应满足一种有利的可区分表示性质。具体而言,在典型的DRL设置中,相邻两个时间步的值函数表示相似度存在一个上界。然而,通过示例实验,我们发现学到的DRL智能体可能违背这一性质,从而导致次优策略。因此,我们提出一种简单而有效的正则化方法,称为基于表示简单正则化的策略评估(PEER),旨在通过对内部表示施加显式正则化来维持可区分表示性质。我们提供了PEER的收敛率保证。实现PEER仅需一行代码。实验表明,将PEER融入DRL能显著提升性能和样本效率。全面的实验展示,PEER在PyBullet的所有4个环境、DMControl的12个任务中的9个以及Atari的26个游戏中的19个上达到了最先进水平。据我们所知,PEER是首个研究Q网络及其目标网络固有表示性质的工作。我们的代码可访问 https://sites.google.com/view/peer-cvpr2023/。