The use of Large Language Models (LLMs) in software development is rapidly growing, with developers increasingly relying on these models for coding assistance, including security-critical tasks. Our work presents a comprehensive comparison between traditional static analysis tools for cryptographic API misuse detection-CryptoGuard, CogniCrypt, and Snyk Code-and the LLMs-GPT and Gemini. Using benchmark datasets (OWASP, CryptoAPI, and MASC), we evaluate the effectiveness of each tool in identifying cryptographic misuses. Our findings show that GPT 4-o-mini surpasses current state-of-the-art static analysis tools on the CryptoAPI and MASC datasets, though it lags on the OWASP dataset. Additionally, we assess the quality of LLM responses to determine which models provide actionable and accurate advice, giving developers insights into their practical utility for secure coding. This study highlights the comparative strengths and limitations of static analysis versus LLM-driven approaches, offering valuable insights into the evolving role of AI in advancing software security practices.
翻译:大型语言模型(LLM)在软件开发中的应用正迅速增长,开发者日益依赖这些模型提供编码辅助,包括安全关键任务。本研究对传统静态分析工具(CryptoGuard、CogniCrypt 和 Snyk Code)与大型语言模型(GPT 和 Gemini)在密码学 API 误用检测方面进行了全面比较。通过基准数据集(OWASP、CryptoAPI 和 MASC),我们评估了各工具识别密码学误用的有效性。研究结果表明,GPT 4-o-mini 在 CryptoAPI 和 MASC 数据集上超越了当前最先进的静态分析工具,但在 OWASP 数据集上表现稍逊。此外,我们评估了大型语言模型响应的质量,以确定哪些模型能提供可操作且准确的建议,从而帮助开发者理解其在安全编码实践中的实际效用。本研究揭示了静态分析与基于大型语言模型的方法各自的优势与局限,为人工智能在推动软件安全实践发展中的演进角色提供了重要见解。