We present the first comprehensive evaluation of AI agents against human cybersecurity professionals in a live enterprise environment. We evaluate ten cybersecurity professionals alongside six existing AI agents and ARTEMIS, our new agent scaffold, on a large university network consisting of ~8,000 hosts across 12 subnets. ARTEMIS is a multi-agent framework featuring dynamic prompt generation, arbitrary sub-agents, and automatic vulnerability triaging. In our comparative study, ARTEMIS placed second overall, discovering 9 valid vulnerabilities with an 82% valid submission rate and outperforming 9 of 10 human participants. While existing scaffolds such as Codex and CyAgent underperformed relative to most human participants, ARTEMIS demonstrated technical sophistication and submission quality comparable to the strongest participants. We observe that AI agents offer advantages in systematic enumeration, parallel exploitation, and cost -- certain ARTEMIS variants cost $18/hour versus $60/hour for professional penetration testers. We also identify key capability gaps: AI agents exhibit higher false-positive rates and struggle with GUI-based tasks.
翻译:我们首次在真实企业环境中对AI智能体与人类网络安全专业人员进行了全面评估。我们在一个包含12个子网、约8000台主机的大型大学网络环境中,评估了10名网络安全专业人员、6个现有AI智能体以及我们新开发的智能体框架ARTEMIS。ARTEMIS是一个多智能体框架,具备动态提示生成、任意子智能体调用和自动漏洞分级功能。在我们的对比研究中,ARTEMIS综合排名第二,发现了9个有效漏洞,提交有效率达82%,表现优于10名人类参与者中的9位。虽然现有框架如Codex和CyAgent的表现低于大多数人类参与者,但ARTEMIS展现出的技术复杂性和提交质量与最优秀的人类参与者相当。我们观察到AI智能体在系统化枚举、并行漏洞利用和成本方面具有优势——某些ARTEMIS变体的运行成本为每小时18美元,而专业渗透测试人员的成本为每小时60美元。同时我们也发现了关键的能力差距:AI智能体存在较高的误报率,且在基于图形界面的任务处理方面存在困难。