Large Language Models for Code (LLMs4Code) have been found to exhibit outstanding performance in the software engineering domain, especially the remarkable performance in coding tasks. However, even the most advanced LLMs4Code can inevitably contain incorrect or outdated code knowledge. Due to the high cost of training LLMs4Code, it is impractical to re-train the models for fixing these problematic code knowledge. Model editing is a new technical field for effectively and efficiently correcting erroneous knowledge in LLMs, where various model editing techniques and benchmarks have been proposed recently. Despite that, a comprehensive study that thoroughly compares and analyzes the performance of the state-of-the-art model editing techniques for adapting the knowledge within LLMs4Code across various code-related tasks is notably absent. To bridge this gap, we perform the first systematic study on applying state-of-the-art model editing approaches to repair the inaccuracy of LLMs4Code. To that end, we introduce a benchmark named CLMEEval, which consists of two datasets, i.e., CoNaLa-Edit (CNLE) with 21K+ code generation samples and CodeSearchNet-Edit (CSNE) with 16K+ code summarization samples. With the help of CLMEEval, we evaluate six advanced model editing techniques on three LLMs4Code: CodeLlama (7B), CodeQwen1.5 (7B), and Stable-Code (3B). Our findings include that the external memorization-based GRACE approach achieves the best knowledge editing effectiveness and specificity (the editing does not influence untargeted knowledge), while generalization (whether the editing can generalize to other semantically-identical inputs) is a universal challenge for existing techniques. Furthermore, building on in-depth case analysis, we introduce an enhanced version of GRACE called A-GRACE, which incorporates contrastive learning to better capture the semantics of the inputs.
翻译:面向代码的大语言模型(LLMs4Code)已被发现在软件工程领域展现出卓越性能,尤其在代码生成任务中表现突出。然而,即使是最先进的LLMs4Code也难免包含错误或过时的代码知识。由于训练LLMs4Code的成本高昂,通过重新训练模型来修正这些有问题的代码知识并不现实。模型编辑是一个旨在高效、有效地纠正大语言模型中错误知识的新兴技术领域,近期已涌现出多种模型编辑技术和基准测试。尽管如此,目前仍缺乏一项全面研究,以系统性地比较和分析最先进的模型编辑技术在适应LLMs4Code内部知识、应对各类代码相关任务时的性能表现。为填补这一空白,我们首次系统性地研究了应用最先进的模型编辑方法来修复LLMs4Code中的知识不准确性问题。为此,我们引入了一个名为CLMEEval的基准测试,它包含两个数据集:CoNaLa-Edit(CNLE),包含超过21,000个代码生成样本;以及CodeSearchNet-Edit(CSNE),包含超过16,000个代码摘要样本。借助CLMEEval,我们在三个LLMs4Code模型上评估了六种先进的模型编辑技术:CodeLlama(7B)、CodeQwen1.5(7B)和Stable-Code(3B)。我们的研究发现包括:基于外部记忆的GRACE方法在知识编辑的有效性和特异性(编辑不影响非目标知识)方面表现最佳,而泛化能力(编辑能否推广到其他语义相同的输入)则是现有技术普遍面临的挑战。此外,基于深入的案例分析,我们提出了GRACE的增强版本A-GRACE,它引入了对比学习以更好地捕捉输入的语义信息。