Actor-critic (AC) algorithms, empowered by neural networks, have had significant empirical success in recent years. However, most of the existing theoretical support for AC algorithms focuses on the case of linear function approximations, or linearized neural networks, where the feature representation is fixed throughout training. Such a limitation fails to capture the key aspect of representation learning in neural AC, which is pivotal in practical problems. In this work, we take a mean-field perspective on the evolution and convergence of feature-based neural AC. Specifically, we consider a version of AC where the actor and critic are represented by overparameterized two-layer neural networks and are updated with two-timescale learning rates. The critic is updated by temporal-difference (TD) learning with a larger stepsize while the actor is updated via proximal policy optimization (PPO) with a smaller stepsize. In the continuous-time and infinite-width limiting regime, when the timescales are properly separated, we prove that neural AC finds the globally optimal policy at a sublinear rate. Additionally, we prove that the feature representation induced by the critic network is allowed to evolve within a neighborhood of the initial one.
翻译:Actor-Critic(AC)算法在神经网络的赋能下,近年来取得了显著的实证成功。然而,现有关于AC算法的理论支持大多聚焦于线性函数近似或线性化神经网络的情形,其中特征表示在整个训练过程中固定不变。这种局限未能捕捉神经AC中作为实际问题关键要素的表示学习核心特征。本文从平均场视角研究基于特征的神经AC的演化与收敛性。具体而言,我们考虑一类AC变体,其中actor和critic由过参数化的双层神经网络表示,并以双时间尺度学习率进行更新。critic通过较大步长的时序差分(TD)学习更新,而actor则通过较小步长的邻近策略优化(PPO)更新。在连续时间与无限宽度的极限框架下,当时间尺度适当分离时,我们证明神经AC能以次线性速率收敛至全局最优策略。此外,我们还证明critic网络诱导的特征表示可在初始特征的邻域内演化。