We develop a new policy gradient and actor-critic algorithm for solving mean-field control problems within a continuous time reinforcement learning setting. Our approach leverages a gradient-based representation of the value function, employing parametrized randomized policies. The learning for both the actor (policy) and critic (value function) is facilitated by a class of moment neural network functions on the Wasserstein space of probability measures, and the key feature is to sample directly trajectories of distributions. A central challenge addressed in this study pertains to the computational treatment of an operator specific to the mean-field framework. To illustrate the effectiveness of our methods, we provide a comprehensive set of numerical results. These encompass diverse examples, including multi-dimensional settings and nonlinear quadratic mean-field control problems with controlled volatility.
翻译:我们提出了一种新的策略梯度和演员-评论家算法,用于在连续时间强化学习框架下求解平均场控制问题。本方法利用值函数的梯度表示,采用参数化随机策略。演员(策略)和评论家(值函数)的学习通过一类定义在概率测度Wasserstein空间上的矩神经网络函数得以实现,其关键特征在于直接对分布轨迹进行采样。本研究解决的核心挑战涉及平均场框架中特有算子的计算处理。为展示本方法的有效性,我们提供了全面的数值实验结果,涵盖多个维度的算例,包括具有可控波动性的非线性二次型平均场控制问题。