Code contains security and functional bugs. The process of identifying and localizing them is difficult and relies on human labor. In this work, we present a novel approach (FLAG) to assist human debuggers. FLAG is based on the lexical capabilities of generative AI, specifically, Large Language Models (LLMs). Here, we input a code file then extract and regenerate each line within that file for self-comparison. By comparing the original code with an LLM-generated alternative, we can flag notable differences as anomalies for further inspection, with features such as distance from comments and LLM confidence also aiding this classification. This reduces the inspection search space for the designer. Unlike other automated approaches in this area, FLAG is language-agnostic, can work on incomplete (and even non-compiling) code and requires no creation of security properties, functional tests or definition of rules. In this work, we explore the features that help LLMs in this classification and evaluate the performance of FLAG on known bugs. We use 121 benchmarks across C, Python and Verilog; with each benchmark containing a known security or functional weakness. We conduct the experiments using two state of the art LLMs in OpenAI's code-davinci-002 and gpt-3.5-turbo, but our approach may be used by other models. FLAG can identify 101 of the defects and helps reduce the search space to 12-17% of source code.
翻译:代码中潜藏着安全漏洞与功能性缺陷,识别并定位这些缺陷的过程十分困难且依赖人工劳动。本文提出一种新颖方法(FLAG)以辅助人类调试工作。FLAG基于生成式AI的词汇处理能力,特别是大语言模型(LLMs)。该方法通过输入代码文件,提取并重新生成文件中每一行代码进行自我比对。通过比较原始代码与LLM生成的替代代码,可将显著差异标记为待深入检查的异常现象,辅助分类的特征包括与注释的距离以及LLM置信度等。这缩小了设计人员所需的排查范围。与领域内其他自动化方法不同,FLAG具有语言无关性,可处理不完整(甚至无法编译)的代码,且无需创建安全属性、功能测试或规则定义。本文深入探究了有助于LLM完成此类分类的特征,并评估了FLAG在已知缺陷上的表现。我们采用涵盖C语言、Python和Verilog的121个基准测试用例,每个测试均包含已知的安全或功能缺陷。实验使用OpenAI的code-davinci-002与gpt-3.5-turbo两个最新LLM模型进行,但该方法同样适用于其他模型。FLAG成功识别了其中101个缺陷,并将排查范围缩小至源代码的12%-17%。