We build \textbf{AICrypto}, a comprehensive benchmark designed to evaluate the cryptography capabilities of large language models (LLMs). The benchmark comprises 135 multiple-choice questions, 150 capture-the-flag challenges, and 30 proof problems, covering a broad range of skills from knowledge memorization to vulnerability exploitation and formal reasoning. All tasks are carefully reviewed or constructed by cryptography experts to improve correctness and rigor. For each proof problem, we provide detailed scoring rubrics and reference solutions that enable automated grading, achieving high correlation with human expert evaluations. We introduce strong human expert performance baselines for comparison across all task types. Our evaluation of 17 leading LLMs reveals that state-of-the-art models match or even surpass human experts in memorizing cryptographic concepts, exploiting common vulnerabilities, and routine proofs. However, our analysis reveals that they still lack a deep understanding of abstract mathematical concepts and struggle with tasks that require multi-step reasoning and dynamic analysis. We hope this work could provide insights for future research on LLMs in cryptographic applications. Our code and dataset are available at https://github.com/wangyu-ovo/aicrypto-agent.
翻译:我们构建了\texbf{AICrypto},这是一个旨在评估大语言模型密码学能力的综合基准测试。该基准测试包含135道选择题、150道夺旗挑战题和30道证明题,覆盖从知识记忆到漏洞利用及形式推理的广泛技能。所有任务均由密码学专家精心审查或构建,以提高正确性与严谨性。针对每道证明题,我们提供了详细评分标准与参考答案,支持自动评分,且评分结果与人类专家评估高度相关。我们引入了强人类专家性能基线,以便在所有任务类型上进行对比。对17个领先大语言模型的评估显示,最先进的模型在记忆密码学概念、利用常见漏洞和常规证明方面已达到甚至超越人类专家水平。然而,分析表明,这些模型仍缺乏对抽象数学概念的深入理解,并在需要多步推理和动态分析的任务中表现挣扎。希望本研究能为未来大语言模型在密码学应用中的研究提供启示。我们的代码与数据集已公开于 https://github.com/wangyu-ovo/aicrypto-agent。