Large Language Models for Code (LLMs4Code) have been found to exhibit outstanding performance in the software engineering domain, especially the remarkable performance in coding tasks. However, even the most advanced LLMs4Code can inevitably contain incorrect or outdated code knowledge. Due to the high cost of training LLMs4Code, it is impractical to re-train the models for fixing these problematic code knowledge. Model editing is a new technical field for effectively and efficiently correcting erroneous knowledge in LLMs, where various model editing techniques and benchmarks have been proposed recently. Despite that, a comprehensive study that thoroughly compares and analyzes the performance of the state-of-the-art model editing techniques for adapting the knowledge within LLMs4Code across various code-related tasks is notably absent. To bridge this gap, we perform the first systematic study on applying state-of-the-art model editing approaches to repair the inaccuracy of LLMs4Code. To that end, we introduce a benchmark named CLMEEval, which consists of two datasets, i.e., CoNaLa-Edit (CNLE) with 21K+ code generation samples and CodeSearchNet-Edit (CSNE) with 16K+ code summarization samples. With the help of CLMEEval, we evaluate six advanced model editing techniques on three LLMs4Code: CodeLlama (7B), CodeQwen1.5 (7B), and Stable-Code (3B). Our findings include that the external memorization-based GRACE approach achieves the best knowledge editing effectiveness and specificity (the editing does not influence untargeted knowledge), while generalization (whether the editing can generalize to other semantically-identical inputs) is a universal challenge for existing techniques. Furthermore, building on in-depth case analysis, we introduce an enhanced version of GRACE called A-GRACE, which incorporates contrastive learning to better capture the semantics of the inputs.
翻译:面向代码的大语言模型(LLMs4Code)在软件工程领域展现出卓越性能,尤其在代码生成任务中表现突出。然而,即使最先进的LLMs4Code也难免包含错误或过时的代码知识。由于训练LLMs4Code成本高昂,通过重新训练模型来修正这些有问题的代码知识并不现实。模型编辑作为一个新兴技术领域,旨在高效修正大语言模型中的错误知识,近期已有多种模型编辑技术和基准测试被提出。尽管如此,目前仍缺乏一项全面研究,系统比较和分析最先进的模型编辑技术在适应LLMs4Code内部知识、应对各类代码相关任务时的性能表现。为填补这一空白,我们首次系统研究了应用前沿模型编辑方法修复LLMs4Code知识不准确性的问题。为此,我们提出了名为CLMEEval的基准测试,包含两个数据集:拥有21,000+代码生成样本的CoNaLa-Edit(CNLE),以及拥有16,000+代码摘要样本的CodeSearchNet-Edit(CSNE)。借助CLMEEval,我们在三个LLMs4Code模型——CodeLlama(7B)、CodeQwen1.5(7B)和Stable-Code(3B)上评估了六种先进的模型编辑技术。研究发现:基于外部记忆的GRACE方法在知识编辑效果和特异性(编辑不影响非目标知识)方面表现最佳,而泛化能力(编辑能否推广到其他语义相同的输入)则是现有技术面临的普遍挑战。此外,基于深入的案例分析,我们提出了GRACE的增强版本A-GRACE,该方法引入对比学习以更好地捕捉输入语义。