Multi-agent reinforcement learning (MARL) has made significant progress in recent years, but most algorithms still rely on a discrete-time Markov Decision Process (MDP) with fixed decision intervals. This formulation is often ill-suited for complex multi-agent dynamics, particularly in high-frequency or irregular time-interval settings, leading to degraded performance and motivating the development of continuous-time MARL (CT-MARL). Existing CT-MARL methods are mainly built on Hamilton-Jacobi-Bellman (HJB) equations. However, they rarely account for safety constraints such as collision penalties, since these introduce discontinuities that make HJB-based learning difficult. To address this challenge, we propose a continuous-time constrained MDP (CT-CMDP) formulation and a novel MARL framework that transforms discrete MDPs into CT-CMDPs via an epigraph-based reformulation. We then solve this by proposing a novel physics-informed neural network (PINN)-based actor-critic method that enables stable and efficient optimization in continuous time. We evaluate our approach on continuous-time safe multi-particle environments (MPE) and safe multi-agent MuJoCo benchmarks. Results demonstrate smoother value approximations, more stable training, and improved performance over safe MARL baselines, validating the effectiveness and robustness of our method.
翻译:近年来,多智能体强化学习(MARL)取得了显著进展,但大多数算法仍依赖于具有固定决策间隔的离散时间马尔可夫决策过程(MDP)。这种表述通常不适合复杂的多智能体动力学,特别是在高频或不规则时间间隔设置中,导致性能下降,从而推动了连续时间多智能体强化学习(CT-MARL)的发展。现有的CT-MARL方法主要基于Hamilton-Jacobi-Bellman(HJB)方程。然而,它们很少考虑诸如碰撞惩罚之类的安全约束,因为这些约束引入了不连续性,使得基于HJB的学习变得困难。为了解决这一挑战,我们提出了一种连续时间约束MDP(CT-CMDP)表述和一个新颖的MARL框架,该框架通过基于epigraph的重构将离散MDP转换为CT-CMDP。随后,我们提出了一种新颖的基于物理信息神经网络(PINN)的演员-评论家方法来解决该问题,该方法能够在连续时间内实现稳定高效的优化。我们在连续时间安全多粒子环境(MPE)和安全多智能体MuJoCo基准测试中评估了我们的方法。结果表明,与安全MARL基线相比,我们的方法实现了更平滑的价值近似、更稳定的训练以及更高的性能,验证了该方法的有效性和鲁棒性。