Temporal Difference (TD) algorithms are widely used in Deep Reinforcement Learning (RL). Their performance is heavily influenced by the size of the neural network. While in supervised learning, the regime of over-parameterization and its benefits are well understood, the situation in RL is much less clear. In this paper, we present a theoretical analysis of the influence of network size and $l_2$-regularization on performance. We identify the ratio between the number of parameters and the number of visited states as a crucial factor and define over-parameterization as the regime when it is larger than one. Furthermore, we observe a double descent phenomenon, i.e., a sudden drop in performance around the parameter/state ratio of one. Leveraging random features and the lazy training regime, we study the regularized Least-Square Temporal Difference (LSTD) algorithm in an asymptotic regime, as both the number of parameters and states go to infinity, maintaining a constant ratio. We derive deterministic limits of both the empirical and the true Mean-Squared Bellman Error (MSBE) that feature correction terms responsible for the double descent. Correction terms vanish when the $l_2$-regularization is increased or the number of unvisited states goes to zero. Numerical experiments with synthetic and small real-world environments closely match the theoretical predictions.
翻译:时序差分(TD)算法广泛应用于深度强化学习(RL)中,其性能深受神经网络规模的影响。尽管在监督学习中,过参数化机制及其优势已被充分理解,但在强化学习中的情况仍远未明确。本文对网络规模及$l_2$正则化对性能的影响进行了理论分析。我们将参数数量与访问状态数量之比确定为关键因素,并将该比值大于1的机制定义为过参数化。此外,我们观察到双重下降现象,即当参数/状态比值接近1时性能发生骤降。利用随机特征和惰性训练机制,我们在参数与状态数量同步趋于无穷大且比值恒定的渐近条件下,研究了正则化最小二乘时序差分(LSTD)算法。我们推导了经验均方贝尔曼误差(MSBE)与真实均方贝尔曼误差的确定性极限,其中包含导致双重下降的修正项。增大$l_2$正则化或使未访问状态数量趋近于零时,修正项将消失。在合成环境及小型真实世界环境中的数值实验与理论预测高度吻合。