Regulatory institutions (from content moderation platforms to financial supervisors) observe, deliberate, and intervene only after a characteristic delay. We ask whether this processing lag alone can destabilize a multi-agent system that would otherwise remain stable, without exogenous shocks, coordination among agents, or malicious actors. We study this question in two stages. First, we analyze a delayed replicator equation in which autonomous agents receive a benefit from radical behavior but face punishment based on a lagged institutional alarm signal. We derive a closed-form critical delay threshold beyond which the unique interior equilibrium loses stability through a Hopf bifurcation, and prove via center manifold reduction that the bifurcation is supercritical (producing bounded oscillations, not explosive growth) for the entire sigmoid response-function family. Second, we embed $N=240$ agents on a network and equip them with reinforcement learning (tabular Q-learning), comparing three decision architectures in a factorial design: non-reactive agents (fixed policy), reactive agents (threshold heuristic without memory), and Q-learning agents (adaptive with cumulative value estimates). The results reveal a hierarchy opposite to the naive expectation that learning amplifies instability: non-reactive agents are immune to delay (0% runaway across all tested values), reactive agents collapse catastrophically (96% runaway by delay $\geq 8$ steps), and Q-learning agents achieve partial resilience (66% runaway at delay $= 20$). The destabilizing ingredient is reactivity to delayed signals: agents that immediately exploit low-alarm windows trigger oscillatory feedback loops. Learning buffers this through implicit punishment memory encoded in Q-values
翻译:监管机构(从内容审核平台到金融监管者)在观察、审议和干预时均存在特征性延迟。本文旨在探究:若不存在外生冲击、智能体协调或恶意行为者,此类处理延迟本身是否足以破坏多智能体系统的稳定性。我们分两个阶段研究该问题。首先,分析一类延迟复制者方程:自主智能体从激进行为中获益,但面临基于滞后机构预警信号的惩罚。我们推导出临界延迟阈值的闭式解,超过该阈值时唯一内部平衡点通过霍普夫分岔丧失稳定性,并利用中心流形约化证明:对于整个S型响应函数族,该分岔为超临界分岔(产生有界振荡而非指数增长)。其次,在网络上嵌入240个智能体并配备强化学习(表格型Q学习),通过析因设计比较三种决策架构:非反应型智能体(固定策略)、反应型智能体(无记忆阈值启发式)与Q学习智能体(具有累积价值估计的自适应机制)。结果揭示出与"学习放大不稳定性"这一朴素预期相反的层级结构:非反应型智能体对延迟免疫(所有测试延迟值下失控率为0%),反应型智能体灾难性崩溃(延迟≥8步时失控率达96%),Q学习智能体则实现部分韧性(延迟=20步时失控率为66%)。导致不稳定的关键在于对延迟信号的反应性:立即利用低预警窗口的智能体会触发振荡反馈循环。而学习机制通过编码在Q值中的隐性惩罚记忆缓冲了此类效应。