We study policy gradient for mean-field control in continuous time in a reinforcement learning setting. By considering randomised policies with entropy regularisation, we derive a gradient expectation representation of the value function, which is amenable to actor-critic type algorithms, where the value functions and the policies are learnt alternately based on observation samples of the state and model-free estimation of the population state distribution, either by offline or online learning. In the linear-quadratic mean-field framework, we obtain an exact parametrisation of the actor and critic functions defined on the Wasserstein space. Finally, we illustrate the results of our algorithms with some numerical experiments on concrete examples.
翻译:我们在强化学习框架下研究了连续时间平均场控制的策略梯度方法。通过考虑带有熵正则化的随机策略,我们推导出值函数的梯度期望表示,该表示适用于演员-评论家类型算法——在这类算法中,值函数与策略基于状态观测样本以及种群状态分布的无模型估计交替学习(可通过离线或在线学习实现)。在线性-二次型平均场框架中,我们获得了定义在Wasserstein空间上的演员与评论家函数的精确参数化表示。最后,我们通过具体算例的数值实验验证了算法的结果。