With increasing numbers of vulnerabilities exposed on the internet, autonomous penetration testing (pentesting) has emerged as an emerging research area, while reinforcement learning (RL) is a natural fit for studying autonomous pentesting. Previous research in RL-based autonomous pentesting mainly focused on enhancing agents' learning efficacy within abstract simulated training environments. They overlooked the applicability and generalization requirements of deploying agents' policies in real-world environments that differ substantially from their training settings. In contrast, for the first time, we shift focus to the pentesting agents' ability to generalize across unseen real environments. For this purpose, we propose a Generalizable Autonomous Pentesting framework (namely GAP) for training agents capable of drawing inferences from one to another -- a key requirement for the broad application of autonomous pentesting and a hallmark of human intelligence. GAP introduces a Real-to-Sim-to-Real pipeline with two key methods: domain randomization and meta-RL learning. Specifically, we are among the first to apply domain randomization in autonomous pentesting and propose a large language model-powered domain randomization method for synthetic environment generation. We further apply meta-RL to improve the agents' generalization ability in unseen environments by leveraging the synthetic environments. The combination of these two methods can effectively bridge the generalization gap and improve policy adaptation performance. Experiments are conducted on various vulnerable virtual machines, with results showing that GAP can (a) enable policy learning in unknown real environments, (b) achieve zero-shot policy transfer in similar environments, and (c) realize rapid policy adaptation in dissimilar environments.
翻译:随着互联网上暴露的漏洞数量日益增多,自主渗透测试已成为一个新兴的研究领域,而强化学习天然适用于自主渗透测试的研究。以往基于强化学习的自主渗透测试研究主要关注在抽象的模拟训练环境中提升智能体的学习效能,却忽视了将智能体策略部署到与训练环境差异巨大的真实环境时所面临的适用性与泛化性需求。与此不同,我们首次将研究重点转向渗透测试智能体在未见真实环境中的泛化能力。为此,我们提出了一个可泛化自主渗透测试框架(简称GAP),用于训练能够举一反三的智能体——这是自主渗透测试广泛应用的关键要求,也是人类智能的标志。GAP引入了一个“真实-模拟-真实”的流程,包含两项关键方法:领域随机化与元强化学习。具体而言,我们率先将领域随机化应用于自主渗透测试,并提出了一种基于大语言模型的领域随机化方法用于生成合成环境。我们进一步应用元强化学习,利用合成环境提升智能体在未见环境中的泛化能力。这两种方法的结合能有效弥合泛化差距并提升策略适应性能。我们在多种存在漏洞的虚拟机上进行了实验,结果表明GAP能够:(a)在未知的真实环境中实现策略学习;(b)在相似环境中实现零样本策略迁移;(c)在相异环境中实现快速策略适应。