Recent years have seen an explosion of interest in autonomous cyber defence agents trained to defend computer networks using deep reinforcement learning. These agents are typically trained in cyber gym environments using dense, highly engineered reward functions which combine many penalties and incentives for a range of (un)desirable states and costly actions. Dense rewards help alleviate the challenge of exploring complex environments but risk biasing agents towards suboptimal and potentially riskier solutions, a critical issue in complex cyber environments. We thoroughly evaluate the impact of reward function structure on learning and policy behavioural characteristics using a variety of sparse and dense reward functions, two well-established cyber gyms, a range of network sizes, and both policy gradient and value-based RL algorithms. Our evaluation is enabled by a novel ground truth evaluation approach which allows directly comparing between different reward functions, illuminating the nuanced inter-relationships between rewards, action space and the risks of suboptimal policies in cyber environments. Our results show that sparse rewards, provided they are goal aligned and can be encountered frequently, uniquely offer both enhanced training reliability and more effective cyber defence agents with lower-risk policies. Surprisingly, sparse rewards can also yield policies that are better aligned with cyber defender goals and make sparing use of costly defensive actions without explicit reward-based numerical penalties.
翻译:近年来,利用深度强化学习训练自主网络防御代理以保护计算机网络的研究呈现爆发式增长。这些代理通常在网络训练环境中使用密集且高度工程化的奖励函数进行训练,这些函数结合了对一系列(非)期望状态和代价高昂动作的多种惩罚与激励。密集奖励有助于缓解复杂环境中的探索挑战,但可能导致代理偏向次优且潜在风险更高的解决方案——这在复杂的网络环境中是一个关键问题。我们通过多种稀疏与密集奖励函数、两个成熟的网络训练环境、一系列网络规模以及基于策略梯度和基于价值的强化学习算法,系统评估了奖励函数结构对学习过程与策略行为特征的影响。我们的评估采用了一种新颖的基准评估方法,该方法允许直接比较不同奖励函数,从而揭示网络环境中奖励、动作空间与次优策略风险之间微妙的相互关系。研究结果表明,只要稀疏奖励与目标一致且能频繁触发,它们不仅能提供更强的训练稳定性,还能产生具有更低风险策略的更有效网络防御代理。令人惊讶的是,稀疏奖励还能产生更符合网络防御者目标的策略,并在无需显式数值惩罚的情况下,审慎使用代价高昂的防御动作。