Large language model (LLM) agents have shown impressive capabilities in human language comprehension and reasoning, yet their potential in cybersecurity remains underexplored. We introduce DefenderBench, a practical, open-source toolkit for evaluating language agents across offense, defense, and cybersecurity knowledge-based tasks. DefenderBench includes environments for network intrusion, malicious content detection, code vulnerability analysis, and cybersecurity knowledge assessment. It is intentionally designed to be affordable and easily accessible for researchers while providing fair and rigorous assessment. We benchmark several state-of-the-art (SoTA) and popular LLMs, including both open- and closed-weight models, using a standardized agentic framework. Our results show that Claude-3.7-sonnet performs best with a DefenderBench score of 81.65, followed by Claude-3.7-sonnet-think with 78.40, while the best open-weight model, Llama 3.3 70B, is not far behind with a DefenderBench score of 71.81. DefenderBench's modular design allows seamless integration of custom LLMs and tasks, promoting reproducibility and fair comparisons. An anonymized version of DefenderBench is available at https://github.com/microsoft/DefenderBench.
翻译:大语言模型(LLM)智能体在人类语言理解与推理方面展现出卓越能力,但其在网络安全领域的潜力仍未得到充分探索。本文介绍DefenderBench——一个实用、开源的工具包,用于在攻击、防御及网络安全知识型任务中评估语言智能体。该工具包包含网络入侵检测、恶意内容识别、代码漏洞分析及网络安全知识评估等测试环境。其设计兼顾经济性与易用性,便于研究人员在保证公平严谨评估的前提下开展实验。我们采用标准化智能体框架对多种前沿及主流大语言模型(包括开源权重与闭源权重模型)进行基准测试。结果显示,Claude-3.7-sonnet以81.65的DefenderBench得分表现最佳,Claude-3.7-sonnet-think以78.40分紧随其后,而最佳开源权重模型Llama 3.3 70B亦表现不俗,获得71.81分。DefenderBench采用模块化设计,支持用户自定义大语言模型与任务的快速集成,有力促进研究可复现性与公平比较。工具包匿名版本已发布于https://github.com/microsoft/DefenderBench。