Temporal Difference (TD) algorithms are widely used in Deep Reinforcement Learning (RL). Their performance is heavily influenced by the size of the neural network. While in supervised learning, the regime of over-parameterization and its benefits are well understood, the situation in RL is much less clear. In this paper, we present a theoretical analysis of the influence of network size and $l_2$-regularization on performance. We identify the ratio between the number of parameters and the number of visited states as a crucial factor and define over-parameterization as the regime when it is larger than one. Furthermore, we observe a double descent phenomenon, i.e., a sudden drop in performance around the parameter/state ratio of one. Leveraging random features and the lazy training regime, we study the regularized Least-Square Temporal Difference (LSTD) algorithm in an asymptotic regime, as both the number of parameters and states go to infinity, maintaining a constant ratio. We derive deterministic limits of both the empirical and the true Mean-Square Bellman Error (MSBE) that feature correction terms responsible for the double-descent. Correction terms vanish when the $l_2$-regularization is increased or the number of unvisited states goes to zero. Numerical experiments with synthetic and small real-world environments closely match the theoretical predictions.
翻译:时序差分(TD)算法广泛应用于深度强化学习(RL)中,其性能受神经网络规模影响显著。尽管在监督学习中,过参数化机制及其优势已获得深入理解,但在强化学习场景中相关认知仍相当有限。本文从理论层面分析了网络规模与$l_2$正则化对性能的影响机制。我们确定参数数量与访问状态数量之比是关键因素,并定义当该比值大于1时为过参数化状态。进一步研究发现双重下降现象,即参数/状态比接近1时性能出现骤降。通过引入随机特征和懒惰训练机制,我们在大参数与状态数量趋于无穷且保持恒定比值的渐进条件下,研究了正则化最小二乘时序差分(LSTD)算法。推导得出经验与真实均方贝尔曼误差(MSBE)的确定性极限,该极限包含导致双重下降的修正项。增大$l_2$正则化或使未访问状态数量趋近于零时,修正项将消失。基于合成环境与小规模真实环境的数值实验与理论预测高度吻合。