NYU CTF Dataset: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security

Minghao Shao,Sofija Jancheska,Meet Udeshi,Brendan Dolan-Gavitt,Haoran Xi,Kimberly Milner,Boyuan Chen,Max Yin,Siddharth Garg,Prashanth Krishnamurthy,Farshad Khorrami,Ramesh Karri,Muhammad Shafique

Large Language Models (LLMs) are being deployed across various domains today. However, their capacity to solve Capture the Flag (CTF) challenges in cybersecurity has not been thoroughly evaluated. To address this, we develop a novel method to assess LLMs in solving CTF challenges by creating a scalable, open-source benchmark database specifically designed for these applications. This database includes metadata for LLM testing and adaptive learning, compiling a diverse range of CTF challenges from popular competitions. Utilizing the advanced function calling capabilities of LLMs, we build a fully automated system with an enhanced workflow and support for external tool calls. Our benchmark dataset and automated framework allow us to evaluate the performance of five LLMs, encompassing both black-box and open-source models. This work lays the foundation for future research into improving the efficiency of LLMs in interactive cybersecurity tasks and automated task planning. By providing a specialized dataset, our project offers an ideal platform for developing, testing, and refining LLM-based approaches to vulnerability detection and resolution. Evaluating LLMs on these challenges and comparing with human performance yields insights into their potential for AI-driven cybersecurity solutions to perform real-world threat management. We make our dataset open source to public https://github.com/NYU-LLM-CTF/LLM_CTF_Database along with our playground automated framework https://github.com/NYU-LLM-CTF/llm_ctf_automation.

翻译：当前，大型语言模型（LLM）正被部署于各个领域。然而，其在网络安全领域解决夺旗赛（CTF）挑战的能力尚未得到充分评估。为此，我们开发了一种新颖的方法，通过创建一个专门为此类应用设计的可扩展开源基准数据库来评估LLM解决CTF挑战的能力。该数据库包含用于LLM测试和自适应学习的元数据，汇集了来自热门竞赛的各类CTF挑战。利用LLM先进的功能调用能力，我们构建了一个具有增强工作流并支持外部工具调用的全自动化系统。我们的基准数据集和自动化框架使我们能够评估五种LLM（包括黑盒模型和开源模型）的性能。这项工作为未来研究如何提高LLM在交互式网络安全任务和自动化任务规划中的效率奠定了基础。通过提供专门的数据集，本项目为开发、测试和完善基于LLM的漏洞检测与解决方法提供了一个理想的平台。在这些挑战上评估LLM并与人类表现进行比较，有助于深入理解其在AI驱动的网络安全解决方案中执行现实世界威胁管理的潜力。我们将数据集开源公开（https://github.com/NYU-LLM-CTF/LLM_CTF_Database），并同时开放我们的自动化框架平台（https://github.com/NYU-LLM-CTF/llm_ctf_automation）。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日