Large Language Models (LLMs) are being deployed across various domains today. However, their capacity to solve Capture the Flag (CTF) challenges in cybersecurity has not been thoroughly evaluated. To address this, we develop a novel method to assess LLMs in solving CTF challenges by creating a scalable, open-source benchmark database specifically designed for these applications. This database includes metadata for LLM testing and adaptive learning, compiling a diverse range of CTF challenges from popular competitions. Utilizing the advanced function calling capabilities of LLMs, we build a fully automated system with an enhanced workflow and support for external tool calls. Our benchmark dataset and automated framework allow us to evaluate the performance of five LLMs, encompassing both black-box and open-source models. This work lays the foundation for future research into improving the efficiency of LLMs in interactive cybersecurity tasks and automated task planning. By providing a specialized dataset, our project offers an ideal platform for developing, testing, and refining LLM-based approaches to vulnerability detection and resolution. Evaluating LLMs on these challenges and comparing with human performance yields insights into their potential for AI-driven cybersecurity solutions to perform real-world threat management. We make our dataset open source to public https://github.com/NYU-LLM-CTF/LLM_CTF_Database along with our playground automated framework https://github.com/NYU-LLM-CTF/llm_ctf_automation.
翻译:当前,大型语言模型(LLM)正被部署于各个领域。然而,其在网络安全领域解决夺旗赛(CTF)挑战的能力尚未得到充分评估。为此,我们开发了一种新颖的方法,通过创建一个专门为此类应用设计的可扩展开源基准数据库来评估LLM解决CTF挑战的能力。该数据库包含用于LLM测试和自适应学习的元数据,汇集了来自热门竞赛的各类CTF挑战。利用LLM先进的功能调用能力,我们构建了一个具有增强工作流并支持外部工具调用的全自动化系统。我们的基准数据集和自动化框架使我们能够评估五种LLM(包括黑盒模型和开源模型)的性能。这项工作为未来研究如何提高LLM在交互式网络安全任务和自动化任务规划中的效率奠定了基础。通过提供专门的数据集,本项目为开发、测试和完善基于LLM的漏洞检测与解决方法提供了一个理想的平台。在这些挑战上评估LLM并与人类表现进行比较,有助于深入理解其在AI驱动的网络安全解决方案中执行现实世界威胁管理的潜力。我们将数据集开源公开(https://github.com/NYU-LLM-CTF/LLM_CTF_Database),并同时开放我们的自动化框架平台(https://github.com/NYU-LLM-CTF/llm_ctf_automation)。