Safety-Contract Graph Multi-Agent Reinforcement Learning for Autonomous Network Security Response

Autonomous network-security response systems promise to reduce Security Operations Centre (SOC) reaction latency, but reward-only multi-agent reinforcement learning (MARL) can improve security reward while remaining non-deployable. We present a safety-contract graph MARL framework and instantiate it as ACD$^3$-GAT (Adaptive Constrained Counterfactual Decisioning with a Graph Attention Network encoder), an architecture that separates simulator observations from reusable operational budgets, constrained optimization, graph state encoding, and counterfactual action screening. We evaluate the method in CAGE Challenge 4, where agents operate under budgets for Mean Time to Recover (MTTR), false-positive response, and firewall change-management disruption. Across the benchmark, every unconstrained method violates the SOC downtime budget in 100% of evaluated episodes, with mean downtime proxy costs of 311-430 against a budget of 50. This complements prior CAGE Challenge 4 findings by showing that reward-only learning lacks operational discipline. Constrained MAPPO-GAT (C-MAPPO-GAT) isolates Lagrangian operational-cost control and budget-aware screening, while ACD$^3$-GAT adds budget context, CVaR tail-risk estimation, opponent-belief state, and Graph Counterfactual Risk Propagation (G-CRP). The replicated comparison includes three 200-episode seeds for IPPO, MAPPO-GAT, C-MAPPO-GAT, and ACD$^3$-GAT. C-MAPPO-GAT reduces downtime violation from 100% to 0.3% and mean downtime cost from 355.4 to 15.5 relative to MAPPO-GAT. ACD$^3$-GAT reduces mean downtime cost to 48.2 with a 13.8% violation rate, placing it on the safety-contract frontier rather than at the most conservative compliance point. Topology-seed and coupled adaptive Red-process stress tests preserve this contrast and show lower worst adaptive degradation for safety-constrained policies than reward-only MAPPO-GAT.

翻译：自主网络安全响应系统承诺降低安全运营中心（SOC）响应延迟，但仅依赖奖励信号的多智能体强化学习（MARL）虽能提升安全奖励，却仍无法实际部署。我们提出一种安全合约图多智能体强化学习框架，并将其实例化为ACD$^3$-GAT（基于图注意力网络编码器的自适应约束反事实决策架构），该架构将仿真器观测与可复用运维预算相分离，融合约束优化、图状态编码及反事实动作筛选。我们在CAGE挑战赛4中评估该方法，智能体需在平均恢复时间（MTTR）、误报响应及防火墙变更管理中断的预算约束下运行。基准测试中，所有无约束方法在100%的评估回合中违反SOC停机预算，平均停机代理成本达311-430，远超50的预算值。这补充了先前CAGE挑战赛4的发现，表明仅依赖奖励信号的缺乏操作纪律。受约束MAPPO-GAT（C-MAPPO-GAT）实现了拉格朗日运维成本控制与预算感知筛选的隔离，而ACD$^3$-GAT进一步引入预算上下文、CVaR尾部风险估计、对手信念状态及图反事实风险传播（G-CRP）。复现对比包含IPPO、MAPPO-GAT、C-MAPPO-GAT与ACD$^3$-GAT各三个200回合种子。C-MAPPO-GAT将停机违规率从100%降至0.3%，平均停机成本从355.4降至15.5（相对于MAPPO-GAT）。ACD$^3$-GAT将平均停机成本降至48.2，违规率为13.8%，使其处于安全合约前沿而非最保守合规点。拓扑种子与耦合自适应红队压力测试维持此对比格局，并显示受安全约束策略的最差自适应退化程度低于纯奖励型MAPPO-GAT。