Originating from semantic bugs, Entity-Inconsistency Bugs (EIBs) involve misuse of syntactically valid yet incorrect program entities, such as variable identifiers and function names, which often have security implications. Unlike straightforward syntactic vulnerabilities, EIBs are subtle and can remain undetected for years. Traditional detection methods, such as static analysis and dynamic testing, often fall short due to the versatile and context-dependent nature of EIBs. However, with advancements in Large Language Models (LLMs) like GPT-4, we believe LLM-powered automatic EIB detection becomes increasingly feasible through these models' semantics understanding abilities. This research first undertakes a systematic measurement of LLMs' capabilities in detecting EIBs, revealing that GPT-4, while promising, shows limited recall and precision that hinder its practical application. The primary problem lies in the model's tendency to focus on irrelevant code snippets devoid of EIBs. To address this, we introduce a novel, cascaded EIB detection system named WitheredLeaf, which leverages smaller, code-specific language models to filter out most negative cases and mitigate the problem, thereby significantly enhancing the overall precision and recall. We evaluated WitheredLeaf on 154 Python and C GitHub repositories, each with over 1,000 stars, identifying 123 new flaws, 45% of which can be exploited to disrupt the program's normal operations. Out of 69 submitted fixes, 27 have been successfully merged.
翻译:实体不一致性缺陷(Entity-Inconsistency Bugs, EIBs)源于语义错误,涉及对语法正确但语义错误的程序实体(如变量标识符和函数名)的误用,此类缺陷往往具有安全影响。与直白的语法漏洞不同,EIBs隐蔽性强,可能多年未被发现。由于EIBs的多样性和上下文依赖性,传统检测方法(如静态分析和动态测试)常常力不从心。然而,随着GPT-4等大语言模型(LLMs)的发展,我们认为基于LLM的自动化EIB检测将因这些模型的语义理解能力而日益可行。本研究首先系统性地评估了LLMs检测EIBs的能力,发现GPT-4虽具潜力,但其较低的召回率和精确率阻碍了实际应用。主要问题在于模型倾向于关注不含EIBs的无关代码片段。为解决此问题,我们提出了一种名为WitheredLeaf的新型级联EIB检测系统,该系统利用轻量级代码专用语言模型过滤大多数负例并缓解上述问题,从而显著提升整体精确率和召回率。我们在154个Python和C语言GitHub仓库(每个仓库星标数超过1000)上对WitheredLeaf进行了评估,发现了123个新缺陷,其中45%可被利用以破坏程序的正常运行。在提交的69个修复方案中,已有27个被成功合并。