Recent code large language models (LLMs) have shown promising performance in generating standalone functions but face limitations in repository-level code generation due to their lack of awareness of repository-level dependencies (e.g., user-defined attributes), resulting in dependency errors such as undefined-variable and no-member errors. In this work, we introduce ToolGen, an approach that integrates autocompletion tools into the code LLM generation process to address these dependencies. ToolGen comprises two main phases: Trigger Insertion and Model Fine-tuning (Offline), and Tool-integrated Code Generation (Online). During the offline phase, ToolGen augments functions within a given code corpus with a special mark token, indicating positions to trigger autocompletion tools. These augmented functions, along with their corresponding docstrings, are then used to fine-tune a selected code LLM. In the online phase, ToolGen iteratively generates functions by predicting tokens step-by-step using the fine-tuned LLM. Whenever a mark token is encountered, ToolGen invokes the autocompletion tool to suggest code completions and selects the most appropriate one. We conduct comprehensive experiments to evaluate ToolGen's effectiveness in repository-level code generation. To facilitate this evaluation, we create a benchmark comprising 680 real-world code repositories and introduce two new repository-level metrics: Dependency Coverage and Static Validity Rate. The results demonstrate that ToolGen significantly improves Dependency Coverage by 15.2% to 45.8% and Static Validity Rate by 10.9% to 42.2% across three distinct code LLMs, while maintaining competitive performance in widely-recognized similarity metrics. Furthermore, our generalizability evaluation confirms ToolGen's consistent performance when applied to diverse code LLMs, including various model architectures and scales.
翻译:近期,代码大语言模型在生成独立函数方面展现出良好性能,但由于缺乏对仓库级依赖关系(如用户自定义属性)的感知,在仓库级代码生成中面临局限性,导致未定义变量和无成员等依赖错误。本研究提出ToolGen方法,通过将自动补全工具集成到代码大语言模型生成过程中来解决这些依赖问题。ToolGen包含两个主要阶段:触发器插入与模型微调(离线阶段)以及工具集成代码生成(在线阶段)。离线阶段中,ToolGen通过在给定代码语料库的函数中插入特殊标记令牌,指示触发自动补全工具的位置。这些增强后的函数及其对应的文档字符串随后用于微调选定的代码大语言模型。在线阶段中,ToolGen利用微调后的大语言模型逐步预测令牌,迭代生成函数。每当遇到标记令牌时,ToolGen将调用自动补全工具生成代码补全建议,并选择最合适的补全结果。我们通过综合实验评估ToolGen在仓库级代码生成中的有效性。为促进评估,我们构建了包含680个真实代码仓库的基准测试集,并引入了两个新的仓库级评价指标:依赖覆盖率和静态有效性比率。实验结果表明,在三种不同的代码大语言模型上,ToolGen使依赖覆盖率提升15.2%至45.8%,静态有效性比率提升10.9%至42.2%,同时保持广受认可的相似度指标上的竞争性表现。此外,泛化性评估证实,ToolGen在应用于不同架构和规模的多样化代码大语言模型时均能保持稳定的性能。