Large language models (LLMs) have demonstrated remarkable capabilities across a variety of domains. However, their applications in cryptography, which serve as a foundational pillar of cybersecurity, remain largely unexplored. To address this gap, we build \textbf{AICrypto}, a comprehensive benchmark designed to evaluate the cryptography capabilities of LLMs. The benchmark comprises 135 multiple-choice questions, 150 capture-the-flag challenges, and 30 proof problems, covering a broad range of skills from knowledge memorization to vulnerability exploitation and formal reasoning. All tasks are carefully reviewed or constructed by cryptography experts to improve correctness and rigor. For each proof problem, we provide detailed scoring rubrics and reference solutions that enable automated grading, achieving high correlation with human expert evaluations. We introduce strong human expert performance baselines for comparison across all task types. Our evaluation of 17 leading LLMs reveals that state-of-the-art models match or even surpass human experts in memorizing cryptographic concepts, exploiting common vulnerabilities, and routine proofs. However, our analysis reveals that they still lack a deep understanding of abstract mathematical concepts and struggle with tasks that require multi-step reasoning and dynamic analysis. We hope this work could provide insights for future research on LLMs in cryptographic applications. Our code and dataset are available at https://aicryptobench.github.io/.
翻译:大语言模型(LLMs)已在多个领域展现出卓越能力。然而,其在作为网络安全基础支柱的密码学领域的应用仍鲜有探索。为填补这一空白,我们构建了 **AICrypto**——一个用于评估大语言模型密码学能力的综合性基准测试。该基准包含135道选择题、150项夺旗挑战和30个证明问题,覆盖了从知识记忆到漏洞利用及形式化推理的广泛技能。所有任务均经过密码学专家仔细评审或构建,以确保正确性与严谨性。针对每个证明问题,我们提供了详细的评分标准和参考答案,支持自动化评分,并与人类专家评估结果保持高度一致性。我们为所有任务类型建立了强人类专家性能基线以供对比。通过对17个领先大语言模型的评估,我们发现最先进的模型在记忆密码学概念、利用常见漏洞及常规证明方面已达到甚至超越人类专家水平。然而,分析表明它们仍缺乏对抽象数学概念的深刻理解,且在需要多步推理和动态分析的任务中表现欠佳。我们希望这项工作能为未来大语言模型在密码学应用中的研究提供启示。我们的代码与数据集公开于 https://aicryptobench.github.io/。