Capture The Flag (CTF) challenges are puzzles related to computer security scenarios. With the advent of large language models (LLMs), more and more CTF participants are using LLMs to understand and solve the challenges. However, so far no work has evaluated the effectiveness of LLMs in solving CTF challenges with a fully automated workflow. We develop two CTF-solving workflows, human-in-the-loop (HITL) and fully-automated, to examine the LLMs' ability to solve a selected set of CTF challenges, prompted with information about the question. We collect human contestants' results on the same set of questions, and find that LLMs achieve higher success rate than an average human participant. This work provides a comprehensive evaluation of the capability of LLMs in solving real world CTF challenges, from real competition to fully automated workflow. Our results provide references for applying LLMs in cybersecurity education and pave the way for systematic evaluation of offensive cybersecurity capabilities in LLMs.
翻译:夺旗赛(CTF)挑战是与计算机安全场景相关的谜题。随着大语言模型(LLMs)的出现,越来越多的CTF参与者利用LLMs来理解和解决挑战。然而,迄今为止尚无研究评估LLMs在完全自动化工作流程中解决CTF挑战的有效性。我们开发了两种CTF解决工作流程——人在回路(HITL)与全自动化,通过提供问题相关信息提示,考察LLMs解决特定CTF挑战的能力。我们收集了人类参赛者在相同问题集上的结果,发现LLMs的成功率高于普通人类参与者。本研究全面评估了LLMs在解决真实CTF挑战(从实际竞赛到全自动化工作流程)中的能力。研究结果为将LLMs应用于网络安全教育提供了参考,并为系统评估LLMs的进攻性网络安全能力奠定了基础。