We propose a novel value approximation method, namely Eigensubspace Regularized Critic (ERC) for deep reinforcement learning (RL). ERC is motivated by an analysis of the dynamics of Q-value approximation error in the Temporal-Difference (TD) method, which follows a path defined by the 1-eigensubspace of the transition kernel associated with the Markov Decision Process (MDP). It reveals a fundamental property of TD learning that has remained unused in previous deep RL approaches. In ERC, we propose a regularizer that guides the approximation error tending towards the 1-eigensubspace, resulting in a more efficient and stable path of value approximation. Moreover, we theoretically prove the convergence of the ERC method. Besides, theoretical analysis and experiments demonstrate that ERC effectively reduces the variance of value functions. Among 26 tasks in the DMControl benchmark, ERC outperforms state-of-the-art methods for 20. Besides, it shows significant advantages in Q-value approximation and variance reduction. Our code is available at https://sites.google.com/view/erc-ecml23/.
翻译:我们提出了一种新颖的价值近似方法,即面向深度强化学习的特征子空间正则化评论家(ERC)。ERC的动机源于对时间差分(TD)方法中Q值近似误差动态的分析,该误差遵循由马尔可夫决策过程(MDP)转移核的1-特征子空间所定义的路径。这揭示了TD学习的一个基本性质,该性质在以往深度强化学习方法中未得到利用。在ERC中,我们提出了一种正则化项,引导近似误差趋向1-特征子空间,从而形成更高效且稳定的价值近似路径。此外,我们从理论上证明了ERC方法的收敛性。理论分析与实验均表明,ERC能有效降低价值函数的方差。在DMControl基准测试的26个任务中,ERC在20个任务上超越了现有最优方法。同时,它在Q值近似和方差降低方面展现出显著优势。我们的代码可访问 https://sites.google.com/view/erc-ecml23/。