This work introduces xOffense, an AI-driven, multi-agent penetration testing framework that shifts the process from labor-intensive, expert-driven manual efforts to fully automated, machine-executable workflows capable of scaling seamlessly with computational infrastructure. At its core, xOffense leverages a fine-tuned, mid-scale open-source LLM (Qwen3-32B) to drive reasoning and decision-making in penetration testing. The framework assigns specialized agents to reconnaissance, vulnerability scanning, and exploitation, with an orchestration layer ensuring seamless coordination across phases. Fine-tuning on Chain-of-Thought penetration testing data further enables the model to generate precise tool commands and perform consistent multi-step reasoning. We evaluate xOffense on two rigorous benchmarks: AutoPenBench and AI-Pentest-Benchmark. The results demonstrate that xOffense consistently outperforms contemporary methods, achieving a sub-task completion rate of 79.17%, decisively surpassing leading systems such as VulnBot and PentestGPT. These findings highlight the potential of domain-adapted mid-scale LLMs, when embedded within structured multi-agent orchestration, to deliver superior, cost-efficient, and reproducible solutions for autonomous penetration testing.
翻译:本文提出了xOffense框架,这是一个基于人工智能的多智能体渗透测试系统,将传统依赖人工、专家驱动的密集型流程,转变为能够在计算基础设施上无缝扩展的全自动化机器执行工作流。该框架核心采用经微调的中等规模开源大语言模型(Qwen3-32B)驱动渗透测试中的推理与决策过程,通过为侦察、漏洞扫描和利用等环节分配专业化智能体,并依托编排层实现各阶段的无缝协同。基于思维链渗透测试数据的微调,使模型能够生成精准的工具指令并执行一致的多步推理。我们在AutoPenBench和AI-Pentest-Benchmark两项严格基准上对xOffense进行评估,结果表明该框架在子任务完成率上达到79.17%,显著超越VulnBot和PentestGPT等前沿系统。这些发现揭示了领域自适应中等规模大语言模型在结构化多智能体编排架构中的潜力,可为自主渗透测试提供高效、经济且可复现的解决方案。