Recent advances in LLM agentic systems have improved the automation of offensive security tasks, particularly for Capture the Flag (CTF) challenges. We systematically investigate the key factors that drive agent success and provide a detailed recipe for building effective LLM-based offensive security agents. First, we present CTFJudge, a framework leveraging LLM as a judge to analyze agent trajectories and provide granular evaluation across CTF solving steps. Second, we propose a novel metric, CTF Competency Index (CCI) for partial correctness, revealing how closely agent solutions align with human-crafted gold standards. Third, we examine how LLM hyperparameters, namely temperature, top-p, and maximum token length, influence agent performance and automated cybersecurity task planning. For rapid evaluation, we present CTFTiny, a curated benchmark of 50 representative CTF challenges across binary exploitation, web, reverse engineering, forensics, and cryptography. Our findings identify optimal multi-agent coordination settings and lay the groundwork for future LLM agent research in cybersecurity. We make CTFTiny open source to public https://github.com/NYU-LLM-CTF/CTFTiny along with CTFJudge on https://github.com/NYU-LLM-CTF/CTFJudge.
翻译:近期大语言模型智能体系统的进展提升了进攻性安全任务的自动化水平,尤其体现在夺旗挑战领域。我们系统性地研究了驱动智能体成功的关键因素,并提出了构建有效大语言模型进攻性安全智能体的详细方案。首先,我们提出CTFJudge框架,该框架利用大语言模型作为评判者分析智能体轨迹,并对CTF解题步骤提供细粒度评估。其次,我们提出新颖的部分正确性评价指标——CTF能力指数,揭示智能体解决方案与人工标注黄金标准之间的契合程度。第三,我们探究了大语言模型超参数(温度、top-p采样和最大token长度)对智能体性能及自动化网络安全任务规划的影响。为实现快速评估,我们提出CTFTiny基准测试集,精选涵盖二进制漏洞利用、Web安全、逆向工程、数字取证和密码学领域的50个代表性CTF挑战题。我们的研究发现确定了最优的多智能体协调配置,为未来网络安全领域的大语言模型智能体研究奠定基础。我们已在https://github.com/NYU-LLM-CTF/CTFTiny开源CTFTiny数据集,并在https://github.com/NYU-LLM-CTF/CTFJudge开源CTFJudge框架。