We propose a novel value approximation method, namely Eigensubspace Regularized Critic (ERC) for deep reinforcement learning (RL). ERC is motivated by an analysis of the dynamics of Q-value approximation error in the Temporal-Difference (TD) method, which follows a path defined by the 1-eigensubspace of the transition kernel associated with the Markov Decision Process (MDP). It reveals a fundamental property of TD learning that has remained unused in previous deep RL approaches. In ERC, we propose a regularizer that guides the approximation error tending towards the 1-eigensubspace, resulting in a more efficient and stable path of value approximation. Moreover, we theoretically prove the convergence of the ERC method. Besides, theoretical analysis and experiments demonstrate that ERC effectively reduces the variance of value functions. Among 26 tasks in the DMControl benchmark, ERC outperforms state-of-the-art methods for 20. Besides, it shows significant advantages in Q-value approximation and variance reduction. Our code is available at https://sites.google.com/view/erc-ecml23/.
翻译:我们提出了一种新颖的价值近似方法,即深度强化学习的本征子空间正则化评价器。该方法受时序差分方法中Q值近似误差动态分析的启发,该动态遵循马尔可夫决策过程转移核的1-本征子空间所定义的路径。这揭示了时序差分学习的一个基本性质,而该性质在以往的深度强化学习方法中未被利用。在ERC中,我们提出了一种正则化器,引导近似误差趋向于1-本征子空间,从而形成更高效、更稳定的价值近似路径。此外,我们从理论上证明了ERC方法的收敛性。理论分析和实验表明,ERC有效降低了值函数的方差。在DMControl基准测试的26项任务中,ERC在20项任务上优于最先进的方法。同时,它在Q值近似和方差降低方面展现出显著优势。我们的代码可在https://sites.google.com/view/erc-ecml23/获取。