This paper demonstrates the potential for autonomous cyber defence to be applied on industrial control systems and provides a baseline environment to further explore Multi-Agent Reinforcement Learning's (MARL) application to this problem domain. It introduces a simulation environment, IPMSRL, of a generic Integrated Platform Management System (IPMS) and explores the use of MARL for autonomous cyber defence decision-making on generic maritime based IPMS Operational Technology (OT). OT cyber defensive actions are less mature than they are for Enterprise IT. This is due to the relatively brittle nature of OT infrastructure originating from the use of legacy systems, design-time engineering assumptions, and lack of full-scale modern security controls. There are many obstacles to be tackled across the cyber landscape due to continually increasing cyber-attack sophistication and the limitations of traditional IT-centric cyber defence solutions. Traditional IT controls are rarely deployed on OT infrastructure, and where they are, some threats aren't fully addressed. In our experiments, a shared critic implementation of Multi Agent Proximal Policy Optimisation (MAPPO) outperformed Independent Proximal Policy Optimisation (IPPO). MAPPO reached an optimal policy (episode outcome mean of 1) after 800K timesteps, whereas IPPO was only able to reach an episode outcome mean of 0.966 after one million timesteps. Hyperparameter tuning greatly improved training performance. Across one million timesteps the tuned hyperparameters reached an optimal policy whereas the default hyperparameters only managed to win sporadically, with most simulations resulting in a draw. We tested a real-world constraint, attack detection alert success, and found that when alert success probability is reduced to 0.75 or 0.9, the MARL defenders were still able to win in over 97.5% or 99.5% of episodes, respectively.
翻译:本文验证了自主网络防御在工业控制系统中应用的潜力,并提供了一个基线环境以进一步探索多智能体强化学习在该问题领域的应用。我们构建了一个通用集成平台管理系统的仿真环境IPMSRL,研究了基于多智能体强化学习在典型海上IPMS运营技术中进行自主网络防御决策的方法。相较于企业信息技术领域,运营技术领域的网络防御行动成熟度较低,这是由于运营技术基础设施相对脆弱——其源于对遗留系统的依赖、设计阶段的工程假设以及缺乏全面的现代安全控制措施。面对日益复杂的网络攻击手段和传统IT中心化防御方案的局限性,网络空间仍存在诸多待解决的障碍。传统IT控制措施很少部署在运营技术基础设施上,即便部署也未能完全应对某些威胁。实验中,采用共享评判器的多智能体近端策略优化算法优于独立近端策略优化算法:MAPPO在800K时间步后达到最优策略(回合结果均值为1),而IPPO在百万时间步后仅能达到0.966的回合结果均值。通过超参数调优显著提升了训练性能:在百万时间步内,调优后的超参数实现了最优策略,而默认超参数仅能间歇性获胜,大多数仿真以平局告终。我们测试了实际约束条件——攻击检测警报成功率,发现当警报成功概率降至0.75或0.9时,MARL防御者仍能分别在超过97.5%或99.5%的回合中获胜。