连续时间多智能体强化学习的价值迭代方法 (Continuous-Time Value Iteration for Multi-Agent Reinforcement Learning)

Existing reinforcement learning (RL) methods struggle with complex dynamical systems that demand interactions at high frequencies or irregular time intervals. Continuous-time RL (CTRL) has emerged as a promising alternative by replacing discrete-time Bellman recursion with differential value functions defined as viscosity solutions of the Hamilton--Jacobi--Bellman (HJB) equation. While CTRL has shown promise, its applications have been largely limited to the single-agent domain. This limitation stems from two key challenges: (i) conventional solution methods for HJB equations suffer from the curse of dimensionality (CoD), making them intractable in high-dimensional systems; and (ii) even with HJB-based learning approaches, accurately approximating centralized value functions in multi-agent settings remains difficult, which in turn destabilizes policy training. In this paper, we propose a CT-MARL framework that uses physics-informed neural networks (PINNs) to approximate HJB-based value functions at scale. To ensure the value is consistent with its differential structure, we align value learning with value-gradient learning by introducing a Value Gradient Iteration (VGI) module that iteratively refines value gradients along trajectories. This improves gradient fidelity, in turn yielding more accurate values and stronger policy learning. We evaluate our method using continuous-time variants of standard benchmarks, including multi-agent particle environment (MPE) and multi-agent MuJoCo. Our results demonstrate that our approach consistently outperforms existing continuous-time RL baselines and scales to complex multi-agent dynamics.

翻译：现有的强化学习方法在处理需要高频交互或不规则时间间隔的复杂动态系统时面临困难。连续时间强化学习通过将离散时间贝尔曼递归替换为哈密顿-雅可比-贝尔曼方程粘性解定义的微分价值函数，已成为一种有前景的替代方案。尽管连续时间强化学习展现出潜力，但其应用主要局限于单智能体领域。这种局限性源于两个关键挑战：(i) 传统HJB方程求解方法受维度诅咒影响，在高维系统中难以求解；(ii) 即使采用基于HJB的学习方法，在多智能体设置中准确逼近集中式价值函数仍然困难，进而导致策略训练不稳定。本文提出CT-MARL框架，采用物理信息神经网络大规模逼近基于HJB的价值函数。为确保价值函数与其微分结构的一致性，我们通过引入价值梯度迭代模块将价值学习与价值梯度学习对齐，该模块沿轨迹迭代优化价值梯度。这提升了梯度保真度，进而产生更准确的价值估计和更强的策略学习能力。我们使用标准基准的连续时间变体评估方法，包括多智能体粒子环境与多智能体MuJoCo。实验结果表明，该方法在连续时间强化学习基线中持续保持优势，并能扩展到复杂的多智能体动态系统。