We prove that a single-layer neural network trained with the online actor critic algorithm converges in distribution to a random ordinary differential equation (ODE) as the number of hidden units and the number of training steps $\rightarrow \infty$. In the online actor-critic algorithm, the distribution of the data samples dynamically changes as the model is updated, which is a key challenge for any convergence analysis. We establish the geometric ergodicity of the data samples under a fixed actor policy. Then, using a Poisson equation, we prove that the fluctuations of the model updates around the limit distribution due to the randomly-arriving data samples vanish as the number of parameter updates $\rightarrow \infty$. Using the Poisson equation and weak convergence techniques, we prove that the actor neural network and critic neural network converge to the solutions of a system of ODEs with random initial conditions. Analysis of the limit ODE shows that the limit critic network will converge to the true value function, which will provide the actor an asymptotically unbiased estimate of the policy gradient. We then prove that the limit actor network will converge to a stationary point.
翻译:我们证明,当隐藏单元数量和训练步数趋于无穷时,采用在线演员-评论家算法训练的单层神经网络在分布上收敛于一个随机常微分方程(ODE)。在在线演员-评论家算法中,随着模型更新,数据样本的分布动态变化,这是收敛分析的关键挑战。我们建立了固定演员策略下数据样本的几何遍历性。接着,利用泊松方程,我们证明由随机到达的数据样本引起的模型更新围绕极限分布的波动随参数更新次数趋于无穷而消失。通过泊松方程和弱收敛技术,我们证明演员神经网络和评论家神经网络收敛于具有随机初始条件的常微分方程组的解。对极限常微分方程的分析表明,极限评论家网络将收敛于真实值函数,从而为演员提供策略梯度的渐近无偏估计。我们进一步证明极限演员网络将收敛于驻点。