While the automated detection of cryptographic API misuses has progressed significantly, its precision diminishes for intricate targets due to the reliance on manually defined patterns. Large Language Models (LLMs) offer a promising context-aware understanding to address this shortcoming, yet the stochastic nature and the hallucination issue pose challenges to their applications in precise security analysis. This paper presents the first systematic study to explore LLMs' application in cryptographic API misuse detection. Our findings are noteworthy: The instability of directly applying LLMs results in over half of the initial reports being false positives. Despite this, the reliability of LLM-based detection could be significantly enhanced by aligning detection scopes with realistic scenarios and employing a novel code and analysis validation technique, achieving a nearly 90% detection recall. This improvement substantially surpasses traditional methods and leads to the discovery of previously unknown vulnerabilities in established benchmarks. Nevertheless, we identify recurring failure patterns that illustrate current LLMs' blind spots. Leveraging these findings, we deploy an LLM-based detection system and uncover 63 new vulnerabilities (47 confirmed, 7 already fixed) in open-source Java and Python repositories, including prominent projects like Apache.
翻译:尽管加密API误用的自动化检测已取得显著进展,但由于依赖手动定义的模式,其在复杂目标上的检测精度有所下降。大语言模型(LLMs)提供了具有前景的上下文感知理解能力,有望弥补这一不足,但其随机性本质和幻觉问题为其在精确安全分析中的应用带来了挑战。本文首次系统性地探索了LLMs在加密API误用检测中的应用。我们的发现值得关注:直接应用LLMs的不稳定性导致超过一半的初始报告为误报。尽管如此,通过将检测范围与现实场景对齐,并采用一种新颖的代码与分析验证技术,基于LLM的检测可靠性可得到显著提升,实现了接近90%的检测召回率。这一改进大幅超越了传统方法,并导致在既有基准测试中发现了先前未知的漏洞。然而,我们识别出了反复出现的失败模式,这些模式揭示了当前LLMs的盲点。利用这些发现,我们部署了一个基于LLM的检测系统,并在开源Java和Python代码库中发现了63个新漏洞(其中47个已确认,7个已修复),其中包括Apache等知名项目。