Temporal Difference (TD) algorithms are widely used in Deep Reinforcement Learning (RL). Their performance is heavily influenced by the size of the neural network. While in supervised learning, the regime of over-parameterization and its benefits are well understood, the situation in RL is much less clear. In this paper, we present a theoretical analysis of the influence of network size and $l_2$-regularization on performance. We identify the ratio between the number of parameters and the number of visited states as a crucial factor and define over-parameterization as the regime when it is larger than one. Furthermore, we observe a double descent phenomenon, i.e., a sudden drop in performance around the parameter/state ratio of one. Leveraging random features and the lazy training regime, we study the regularized Least-Square Temporal Difference (LSTD) algorithm in an asymptotic regime, as both the number of parameters and states go to infinity, maintaining a constant ratio. We derive deterministic limits of both the empirical and the true Mean-Square Bellman Error (MSBE) that feature correction terms responsible for the double-descent. Correction terms vanish when the $l_2$-regularization is increased or the number of unvisited states goes to zero. Numerical experiments with synthetic and small real-world environments closely match the theoretical predictions.
翻译:摘要:时间差分(TD)算法广泛应用于深度强化学习(RL),其性能深受神经网络规模影响。尽管在监督学习中,过参数化机制及其优势已获充分理解,但在强化学习中的情况仍不甚明确。本文针对网络规模和$l_2$正则化对性能的影响进行了理论分析。我们确定参数数量与访问状态数量之比为关键因素,并定义该比值大于1的区域为过参数化。此外,我们观察到双重下降现象,即性能参数/状态比在1附近出现骤降。借助随机特征和惰性训练机制,我们在参数数量与状态数量均趋于无穷且保持恒定比值的渐近框架下,研究了正则化最小二乘时间差分(LSTD)算法。我们推导了经验与真实均方贝尔曼误差(MSBE)的确定性极限,该极限包含导致双重下降的修正项。当增大$l_2$正则化或未访问状态数量趋零时,修正项消失。基于合成数据和小规模真实环境的数值实验与理论预测高度吻合。