Penetration testing, the practice of simulating cyberattacks to identify vulnerabilities, is a complex sequential decision-making task that is inherently partially observable and features large action spaces. Training reinforcement learning (RL) policies for this domain faces a fundamental bottleneck: existing simulators are too slow to train on realistic network scenarios at scale, resulting in policies that fail to generalize. We present NASimJax, a complete JAX-based reimplementation of the Network Attack Simulator (NASim), achieving up to 100x higher environment throughput than the original simulator. By running the entire training pipeline on hardware accelerators, NASimJax enables experimentation on larger networks under fixed compute budgets that were previously infeasible. We formulate automated penetration testing as a Contextual POMDP and introduce a network generation pipeline that produces structurally diverse and guaranteed-solvable scenarios. Together, these provide a principled basis for studying zero-shot policy generalization. We use the framework to investigate action-space scaling and generalization across networks of up to 40 hosts. We find that Prioritized Level Replay better handles dense training distributions than Domain Randomization, particularly at larger scales, and that training on sparser topologies yields an implicit curriculum that improves out-of-distribution generalization, even on topologies denser than those seen during training. To handle linearly growing action spaces, we propose a two-stage action decomposition (2SAS) that substantially outperforms flat action masking at scale. Finally, we identify a failure mode arising from the interaction between Prioritized Level Replay's episode-reset behaviour and 2SAS's credit assignment structure. NASimJax thus provides a fast, flexible, and realistic platform for advancing RL-based penetration testing.
翻译:渗透测试(模拟网络攻击以识别漏洞的实践)是一项复杂的序列决策任务,其本质具有部分可观测性,并面临巨大的行动空间。在此领域训练强化学习策略面临根本性瓶颈:现有模拟器在真实网络场景下规模化训练速度过慢,导致训练策略难以泛化。我们提出NASimJax——基于JAX框架对网络攻击模拟器(NASim)的完整重构,实现了相比原始模拟器高达100倍的环境吞吐量。通过将整个训练流程部署在硬件加速器上,NASimJax使得在固定算力预算下对更大规模网络的实验成为可能,而这在以往难以实现。我们将自动化渗透测试形式化为上下文相关部分可观测马尔可夫决策过程,并提出一种能够生成结构多样且可保证可解场景的网络生成流程。二者共同为研究零样本策略泛化提供了原则性基础。我们利用该框架研究行动空间缩放与跨网络(最多40台主机)泛化问题。实验发现,优先层级回放比领域随机化能更优地处理密集训练分布(尤其在更大规模场景下),而稀疏拓扑的训练会形成隐式课程,即使面对比训练时更密集的拓扑,也能提升分布外泛化性能。针对线性增长的行动空间,我们提出两阶段行动分解方法,其性能在大规模场景下显著优于平面行动掩码。最后,我们识别出优先层级回放的回合重置行为与两阶段行动分解的信用分配结构交互作用导致的失败模式。由此,NASimJax为推进基于强化学习的渗透测试提供了一个快速、灵活且真实的平台。