揭示黑盒：一种用于解释基于强化学习的网络攻击智能体的多层框架 (Unveiling the Black Box: A Multi-Layer Framework for Explaining Reinforcement Learning-Based Cyber Agents)

Reinforcement Learning (RL) agents are increasingly used to simulate sophisticated cyberattacks, but their decision-making processes remain opaque, hindering trust, debugging, and defensive preparedness. In high-stakes cybersecurity contexts, explainability is essential for understanding how adversarial strategies are formed and evolve over time. In this paper, we propose a unified, multi-layer explainability framework for RL-based attacker agents that reveals both strategic (MDP-level) and tactical (policy-level) reasoning. At the MDP level, we model cyberattacks as a Partially Observable Markov Decision Processes (POMDPs) to expose exploration-exploitation dynamics and phase-aware behavioural shifts. At the policy level, we analyse the temporal evolution of Q-values and use Prioritised Experience Replay (PER) to surface critical learning transitions and evolving action preferences. Evaluated across CyberBattleSim environments of increasing complexity, our framework offers interpretable insights into agent behaviour at scale. Unlike previous explainable RL methods, which are often post-hoc, domain-specific, or limited in depth, our approach is both agent- and environment-agnostic, supporting use cases ranging from red-team simulation to RL policy debugging. By transforming black-box learning into actionable behavioural intelligence, our framework enables both defenders and developers to better anticipate, analyse, and respond to autonomous cyber threats.

翻译：强化学习（RL）智能体越来越多地被用于模拟复杂的网络攻击，但其决策过程仍然不透明，阻碍了信任建立、调试和防御准备。在高风险的网络安全环境中，可解释性对于理解对抗策略如何形成并随时间演变至关重要。本文提出了一种统一的、多层级的可解释性框架，用于解释基于强化学习的攻击智能体，该框架同时揭示了战略层面（MDP层级）和战术层面（策略层级）的推理过程。在MDP层级，我们将网络攻击建模为部分可观测马尔可夫决策过程（POMDPs），以揭示探索-利用动态以及阶段感知的行为转变。在策略层级，我们分析Q值的时间演变，并利用优先经验回放（PER）来凸显关键的学习转变和不断演变的动作偏好。通过在复杂度递增的CyberBattleSim环境中进行评估，我们的框架提供了大规模智能体行为的可解释性洞见。与以往通常是事后分析、领域特定或深度有限的可解释强化学习方法不同，我们的方法同时独立于智能体和环境，支持从红队模拟到强化学习策略调试等多种用例。通过将黑盒学习转化为可操作的行为智能，我们的框架使防御者和开发者都能更好地预测、分析和应对自主网络威胁。