While the automated detection of cryptographic API misuses has progressed significantly, its precision diminishes for intricate targets due to the reliance on manually defined patterns. Large Language Models (LLMs), renowned for their contextual understanding, offer a promising avenue to address existing shortcomings. However, applying LLMs in this security-critical domain presents challenges, particularly due to the unreliability stemming from LLMs' stochastic nature and the well-known issue of hallucination. To explore the prevalence of LLMs' unreliable analysis and potential solutions, this paper introduces a systematic evaluation framework to assess LLMs in detecting cryptographic misuses, utilizing a comprehensive dataset encompassing both manually-crafted samples and real-world projects. Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives. Nevertheless, we demonstrate how a constrained problem scope, coupled with LLMs' self-correction capability, significantly enhances the reliability of the detection. The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks. Moreover, we identify the failure patterns that persistently hinder LLMs' reliability, including both cryptographic knowledge deficiency and code semantics misinterpretation. Guided by these insights, we develop an LLM-based workflow to examine open-source repositories, leading to the discovery of 63 real-world cryptographic misuses. Of these, 46 have been acknowledged by the development community, with 23 currently being addressed and 6 resolved. Reflecting on developers' feedback, we offer recommendations for future research and the development of LLM-based security tools.
翻译:尽管加密API误用的自动检测已取得显著进展,但由于依赖人工定义的模式,其在复杂目标上的精确度有所下降。大语言模型(LLMs)以其情境理解能力著称,为解决现有缺陷提供了前景广阔的途径。然而,在此安全关键领域应用LLMs存在挑战,特别是源于LLMs随机性导致的不可靠性以及众所周知的幻觉问题。为探究LLMs不可靠分析的普遍性及潜在解决方案,本文引入一个系统性评估框架,利用包含人工构建样本和实际项目的综合数据集,评估LLMs在检测加密误用方面的表现。我们对11,940份LLM生成报告的深入分析表明,LLMs固有的不稳定性可能导致超过半数的报告为误报。尽管如此,我们证明了如何通过受限的问题范围结合LLMs的自我纠正能力,显著提升检测的可靠性。优化后的方法实现了接近90%的显著检测率,超越了传统方法,并在既有基准测试中发现了先前未知的误用。此外,我们识别了持续阻碍LLMs可靠性的失效模式,包括加密知识缺陷和代码语义误解。基于这些洞察,我们开发了一个基于LLM的工作流来检查开源仓库,从而发现了63个实际存在的加密误用。其中46个已获得开发社区确认,23个正在处理中,6个已得到解决。结合开发者的反馈,我们为未来研究及基于LLM的安全工具开发提出了建议。