Dual-Gated Epistemic Time-Dilation: Autonomous Compute Modulation in Asynchronous MARL

While Multi-Agent Reinforcement Learning (MARL) algorithms achieve unprecedented successes across complex continuous domains, their standard deployment strictly adheres to a synchronous operational paradigm. Under this paradigm, agents are universally forced to execute deep neural network inferences at every micro-frame, regardless of immediate necessity. This dense throughput acts as a fundamental barrier to physical deployment on edge-devices where thermal and metabolic budgets are highly constrained. We propose Epistemic Time-Dilation MAPPO (ETD-MAPPO), augmented with a Dual-Gated Epistemic Trigger. Instead of depending on rigid frame-skipping (macro-actions), agents autonomously modulate their execution frequency by interpreting aleatoric uncertainty (via Shannon entropy of their policy) and epistemic uncertainty (via state-value divergence in a Twin-Critic architecture). To format this, we structure the environment as a Semi-Markov Decision Process (SMDP) and build the SMDP-Aligned Asynchronous Gradient Masking Critic to ensure proper credit assignment. Empirical findings demonstrate massive improvements (> 60% relative baseline acquisition leaps) over current temporal models. By assessing LBF, MPE, and the 115-dimensional state space of Google Research Football (GRF), ETD correctly prevented premature policy collapse. Remarkably, this unconstrained approach leads to emergent Temporal Role Specialization, reducing computational overhead by a statistically dominant 73.6% entirely during off-ball execution without deteriorating centralized task dominance.

翻译：尽管多智能体强化学习（MARL）算法在复杂的连续域中取得了前所未有的成功，但它们的标准部署严格遵循同步操作范式。在该范式下，智能体被迫在每个微帧上执行深度神经网络推理，无论是否必要。这种密集的计算吞吐量成为在热预算和代谢预算高度受限的边缘设备上物理部署的根本障碍。我们提出一种增强型认知时间膨胀MAPPO（ETD-MAPPO），并配备双门控认知触发器。该方法不依赖刚性帧跳跃（宏动作），而是通过解释偶然不确定性（策略的香农熵）和认知不确定性（双评论家架构中的状态价值散度）来自主调节执行频率。为了形式化描述，我们将环境构建为半马尔可夫决策过程（SMDP），并设计SMDP对齐的异步梯度掩码评论家，以确保正确的信用分配。实验结果表明，与当前时域模型相比，该方法取得了显著提升（相对基线性能提升超过60%）。通过在LBF、MPE以及Google Research Football（GRF）的115维状态空间上的评估，ETD正确防止了策略过早崩溃。值得注意的是，这种无约束方法催生了涌现性时序角色特化，在完全离球执行期间将计算开销降低了具有统计显著性的73.6%，且未损害集中式任务主导性。