CodeUpdateArena: Benchmarking Knowledge Editing on API Updates

Large language models (LLMs) are increasingly being used to synthesize and reason about source code. However, the static nature of these models' knowledge does not reflect the fact that libraries and API functions they invoke are continuously evolving, with functionality being added or changing. While numerous benchmarks evaluate how LLMs can generate code, no prior work has studied how an LLMs' knowledge about code API functions can be updated. To fill this gap, we present CodeUpdateArena, a benchmark for knowledge editing in the code domain. An instance in our benchmark consists of a synthetic API function update paired with a program synthesis example that uses the updated functionality; our goal is to update an LLM to be able to solve this program synthesis example without providing documentation of the update at inference time. Compared to knowledge editing for facts encoded in text, success here is more challenging: a code LLM must correctly reason about the semantics of the modified function rather than just reproduce its syntax. Our dataset is constructed by first prompting GPT-4 to generate atomic and executable function updates. Then, for each update, we generate program synthesis examples whose code solutions are prone to use the update. Our benchmark covers updates of various types to 54 functions from seven diverse Python packages, with a total of 670 program synthesis examples. Our experiments show that prepending documentation of the update to open-source code LLMs (i.e., DeepSeek, CodeLlama) does not allow them to incorporate changes for problem solving, and existing knowledge editing techniques also have substantial room for improvement. We hope our benchmark will inspire new methods for knowledge updating in code LLMs.

翻译：大型语言模型（LLM）正日益广泛地用于源代码的合成与推理。然而，这些模型知识的静态特性无法反映其调用的库和API函数持续演进、功能不断增减变化的现实。尽管已有众多基准评测评估LLM生成代码的能力，但尚无研究探讨如何更新LLM关于代码API函数的知识。为填补这一空白，我们提出CodeUpdateArena——一个面向代码领域的知识编辑基准。该基准中的每个实例包含一个合成的API函数更新以及一个使用更新功能的程序合成示例；我们的目标是在不提供更新文档的情况下，通过更新LLM使其能够解决该程序合成问题。相较于针对文本编码事实的知识编辑，此处的成功更具挑战性：代码LLM必须正确推理修改后函数的语义，而非仅复现其语法。我们通过首先生成GPT-4生成原子化且可执行的函数更新来构建数据集，随后为每个更新生成其代码解决方案倾向于使用该更新的程序合成示例。本基准涵盖对七个不同Python包中54个函数的多类型更新，共包含670个程序合成示例。实验表明：为开源代码LLM（如DeepSeek、CodeLlama）前置更新文档并不能使其有效整合变更以解决问题，现有知识编辑技术也存在显著改进空间。我们希望本基准能激发代码LLM知识更新方法的新研究。