Large language models (LLMs) are increasingly used to assist developers with code, yet their implementations of cryptographic functionality often contain exploitable flaws. Minor design choices (e.g., static initialization vectors or missing authentication) can silently invalidate security guarantees. We introduce CIPHER(\textbf{C}ryptographic \textbf{I}nsecurity \textbf{P}rofiling via \textbf{H}ybrid \textbf{E}valuation of \textbf{R}esponses), a benchmark for measuring cryptographic vulnerability incidence in LLM-generated Python code under controlled security-guidance conditions. CIPHER uses insecure/neutral/secure prompt variants per task, a cryptography-specific vulnerability taxonomy, and line-level attribution via an automated scoring pipeline. Across a diverse set of widely used LLMs, we find that explicit ``secure'' prompting reduces some targeted issues but does not reliably eliminate cryptographic vulnerabilities overall. The benchmark and reproducible scoring pipeline will be publicly released upon publication.
翻译:大型语言模型(LLM)正日益被用于辅助开发者编写代码,但其生成的密码学功能实现中常包含可利用的缺陷。细微的设计选择(例如静态初始化向量或缺失身份验证)可能悄无声息地破坏安全性保障。本文提出CIPHER(基于响应混合评估的密码学不安全特性分析),这是一个在受控安全指导条件下,用于衡量LLM生成的Python代码中密码学漏洞发生率的基准测试框架。CIPHER针对每个任务采用不安全/中性/安全的提示变体,构建了密码学专用的漏洞分类体系,并通过自动化评分流程实现行级缺陷归因。在对多种广泛使用的LLM进行测试后,我们发现显式的“安全”提示能减少某些特定类型的问题,但并不能可靠地消除整体密码学漏洞。本基准测试框架及可复现的评分流程将在论文发表时公开。