Semantic understanding of programs has attracted great attention in the community. Inspired by recent successes of large language models (LLMs) in natural language understanding, tremendous progress has been made by treating programming language as another sort of natural language and training LLMs on corpora of program code. However, programs are essentially different from texts after all, in a sense that they are normally heavily structured and syntax-strict. In particular, programs and their basic units (i.e., functions and subroutines) are designed to demonstrate a variety of behaviors and/or provide possible outputs, given different inputs. The relationship between inputs and possible outputs/behaviors represents the functions/subroutines and profiles the program as a whole. Therefore, we propose to incorporate such a relationship into learning, for achieving a deeper semantic understanding of programs. To obtain inputs that are representative enough to trigger the execution of most part of the code, we resort to fuzz testing and propose fuzz tuning to boost the performance of program understanding and code representation learning, given a pre-trained LLM. The effectiveness of the proposed method is verified on two program understanding tasks including code clone detection and code classification, and it outperforms current state-of-the-arts by large margins. Code is available at https://github.com/rabbitjy/FuzzTuning.
翻译:程序语义理解在学术界受到广泛关注。受大语言模型在自然语言理解领域取得突破的启发,通过将编程语言视为另一种自然语言并在程序代码语料库上训练大语言模型,研究者们取得了显著进展。然而,程序本质上与文本存在根本差异——它们通常具有高度结构化和语法严格的特点。具体而言,程序及其基本单元(即函数和子程序)被设计为在给定不同输入时展示多种行为或提供可能输出。输入与可能输出/行为之间的关系表征了函数/子程序的功能,并整体刻画了程序的特征。因此,我们提出将这种关系融入学习过程,以实现对程序更深层次的语义理解。为获取能够充分触发大部分代码执行的代表性输入,我们借助模糊测试技术,并提出"模糊调优"方法,在预训练大语言模型的基础上提升程序理解与代码表示学习的性能。该方法在代码克隆检测和代码分类两项程序理解任务上得到了验证,结果显著超越了当前最先进水平。相关代码已开源至 https://github.com/rabbitjy/FuzzTuning。