We study model-free policy learning for discrete-time mean-field control (MFC) problems with finite state space and compact action space. In contrast to the extensive literature on value-based methods for MFC, policy-based approaches remain largely unexplored due to the intrinsic dependence of transition kernels and rewards on the evolving population state distribution, which prevents the direct use of likelihood-ratio estimators of policy gradients from classical single-agent reinforcement learning. We introduce a novel perturbation scheme on the state-distribution flow and prove that the gradient of the resulting perturbed value function converges to the true policy gradient as the perturbation magnitude vanishes. This construction yields a fully model-free estimator based solely on simulated trajectories and an auxiliary estimate of the sensitivity of the state distribution. Building on this framework, we develop MF-REINFORCE, a model-free policy gradient algorithm for MFC, and establish explicit quantitative bounds on its bias and mean-squared error. Numerical experiments on representative mean-field control tasks demonstrate the effectiveness of the proposed approach.
翻译:本文研究了具有有限状态空间和紧致动作空间的离散时间平均场控制问题的无模型策略学习方法。与平均场控制领域大量基于价值函数的研究文献相比,基于策略的方法在很大程度上尚未得到探索,这主要是因为转移核与奖励函数对演化群体状态分布的内在依赖性,阻碍了经典单智能体强化学习中似然比策略梯度估计量的直接应用。我们提出了一种针对状态分布流的新型扰动方案,并证明当扰动幅度趋近于零时,所得扰动价值函数的梯度收敛于真实策略梯度。该构造产生了一个完全无模型的估计量,其仅基于模拟轨迹和状态分布敏感度的辅助估计。基于此框架,我们开发了MF-REINFORCE——一种适用于平均场控制的无模型策略梯度算法,并建立了其偏差与均方误差的显式定量界。在代表性平均场控制任务上的数值实验验证了所提方法的有效性。