Unveiling the Black Box: A Multi-Layer Framework for Explaining Reinforcement Learning-Based Cyber Agents

Reinforcement Learning (RL) agents are increasingly used to simulate sophisticated cyberattacks, but their decision-making processes remain opaque, hindering trust, debugging, and defensive preparedness. In high-stakes cybersecurity contexts, explainability is essential for understanding how adversarial strategies are formed and evolve over time. In this paper, we propose a unified, multi-layer explainability framework for RL-based attacker agents that reveals both strategic (Markov Decision Process (MDP)-level) and tactical (policy-level) reasoning. At the MDP-level, we model cyberattacks as a Partially Observable Markov Decision Process (POMDP) to expose exploration-exploitation dynamics and phase-aware behavioural shifts. At the policy-level, we analyse the temporal evolution of Q-values and use Prioritised Experience Replay (PER) to surface critical learning transitions and evolving action preferences. Evaluated across CyberBattleSim environments of increasing complexity, our framework offers interpretable insights into agent behaviour at scale. Unlike previous explainable RL methods, which are {predominantly} post-hoc, domain-specific, or limited in depth, our approach is both agent- and environment-agnostic, {supporting use cases such as red-team simulation, RL policy debugging, phase-aware threat modelling and anticipatory defence planning.} By transforming black-box learning into actionable behavioural intelligence, our framework enables both defenders and developers to better anticipate, analyse, and respond to autonomous cyber threats.

翻译：强化学习（RL）代理越来越多地被用于模拟复杂的网络攻击，但其决策过程仍然不透明，这阻碍了信任建立、调试和防御准备。在高风险的网络安全环境中，可解释性对于理解对抗策略如何形成及随时间演变至关重要。本文提出了一种统一的、多层可解释性框架，用于基于强化学习的攻击者代理，该框架同时揭示了战略层面（马尔可夫决策过程（MDP）级）和战术层面（策略级）的推理。在MDP层面，我们将网络攻击建模为部分可观测马尔可夫决策过程（POMDP），以揭示探索-利用动态和阶段感知的行为转变。在策略层面，我们分析Q值的时间演化，并利用优先经验回放（PER）来揭示关键的学习转变和不断演变的动作偏好。通过在复杂度递增的CyberBattleSim环境中进行评估，我们的框架提供了大规模代理行为的可解释性洞察。与以往主要侧重于事后分析、领域特定或深度有限的可解释强化学习方法不同，我们的方法同时独立于具体代理和环境，支持诸如红队模拟、强化学习策略调试、阶段感知威胁建模和前瞻性防御规划等用例。通过将黑箱学习转化为可操作的行为情报，我们的框架使防御者和开发者都能更好地预测、分析和应对自主网络威胁。