We mathematically analyze and numerically study an actor-critic machine learning algorithm for solving high-dimensional Hamilton-Jacobi-Bellman (HJB) partial differential equations from stochastic control theory. The architecture of the critic (the estimator for the value function) is structured so that the boundary condition is always perfectly satisfied (rather than being included in the training loss) and utilizes a biased gradient which reduces computational cost. The actor (the estimator for the optimal control) is trained by minimizing the integral of the Hamiltonian over the domain, where the Hamiltonian is estimated using the critic. We show that the training dynamics of the actor and critic neural networks converge in a Sobolev-type space to a certain infinite-dimensional ordinary differential equation (ODE) as the number of hidden units in the actor and critic $\rightarrow \infty$. Further, under a convexity-like assumption on the Hamiltonian, we prove that any fixed point of this limit ODE is a solution of the original stochastic control problem. This provides an important guarantee for the algorithm's performance in light of the fact that finite-width neural networks may only converge to a local minimizers (and not optimal solutions) due to the non-convexity of their loss functions. In our numerical studies, we demonstrate that the algorithm can solve stochastic control problems accurately in up to 200 dimensions. In particular, we construct a series of increasingly complex stochastic control problems with known analytic solutions and study the algorithm's numerical performance on them. These problems range from a linear-quadratic regulator equation to highly challenging equations with non-convex Hamiltonians, allowing us to identify and analyze the strengths and limitations of this neural actor-critic method for solving HJB equations.
翻译:我们从数学上分析并数值研究了一种用于求解随机控制理论中高维 Hamilton-Jacobi-Bellman (HJB) 偏微分方程的 actor-critic 机器学习算法。Critic(价值函数估计器)的架构经过特殊设计,使得边界条件始终被完美满足(而非纳入训练损失中),并采用有偏梯度以降低计算成本。Actor(最优控制估计器)通过最小化 Hamiltonian 在定义域上的积分进行训练,其中 Hamiltonian 由 critic 估计得到。我们证明,当 actor 和 critic 的隐藏单元数量趋于无穷时,两者的训练动力学在 Sobolev 型空间中收敛至某一无限维常微分方程 (ODE)。此外,在 Hamiltonian 满足类似凸性假设的条件下,我们证明该极限 ODE 的任一不动点均为原始随机控制问题的解。这一结论为算法性能提供了重要保障,因为有限宽度神经网络可能因其损失函数的非凸性而仅收敛至局部极小值(而非最优解)。在数值研究中,我们展示了该算法能够精确求解高达 200 维的随机控制问题。具体而言,我们构造了一系列具有已知解析解的、复杂度递增的随机控制问题,并研究了算法对其的数值表现。这些问题涵盖了从线性二次型调节器方程到具有非凸 Hamiltonian 的极具挑战性的方程,从而帮助我们识别并分析该神经 actor-critic 方法在求解 HJB 方程时的优势与局限性。