Recent code large language models (LLMs) have shown promising performance in generating standalone functions but face limitations in repository-level code generation due to their lack of awareness of repository-level dependencies (e.g., user-defined attributes), resulting in dependency errors such as undefined-variable and no-member errors. In this work, we introduce ToolGen, an approach that integrates autocompletion tools into the code LLM generation process to address these dependencies. ToolGen comprises two main phases: Data Augmentation and Model Fine-tuning (Offline), and Tool-integrated Code Generation (Online). During the offline phase, ToolGen augments functions within a given code corpus with a special mark token, indicating positions to trigger autocompletion tools. These augmented functions, along with their corresponding docstrings, are then used to fine-tune a selected code LLM. In the online phase, ToolGen iteratively generates functions by predicting tokens step-by-step using the fine-tuned LLM. Whenever a mark token is encountered, ToolGen invokes the autocompletion tool to suggest code completions and selects the most appropriate one. We conduct comprehensive experiments to evaluate ToolGen's effectiveness in repository-level code generation. To facilitate this evaluation, we create a benchmark comprising 680 real-world code repositories and introduce two new repository-level metrics: Dependency Coverage and Success Rate. The results demonstrate that ToolGen significantly improves dependency coverage by 15.2% to 45.8% and success rates by 10.9% to 42.2% across three distinct code LLMs, while maintaining competitive performance in widely-recognized similarity metrics. Furthermore, our generalizability evaluation confirms ToolGen's consistent performance when applied to diverse code LLMs, including various model architectures and scales.
翻译:近期,代码大语言模型在生成独立函数方面展现出良好性能,但由于缺乏对仓库级依赖关系(如用户定义属性)的感知,在仓库级代码生成中面临局限性,导致出现变量未定义、成员不存在等依赖错误。为此,我们提出ToolGen方法,通过将自动补全工具集成到代码大语言模型生成过程中来解决这些依赖问题。ToolGen包含两个主要阶段:数据增强与模型微调(离线阶段)和工具集成式代码生成(在线阶段)。离线阶段中,ToolGen在给定代码库的函数中插入特殊标记token,标识需触发自动补全工具的位置。这些增强后的函数及其对应文档字符串被用于微调所选代码大语言模型。在线阶段中,ToolGen通过微调模型逐步预测token,迭代生成函数。每当遇到标记token时,ToolGen调用自动补全工具生成代码候选项并选择最合适的补全结果。我们开展了全面实验评估ToolGen在仓库级代码生成中的有效性。为支持该评估,我们构建了包含680个真实代码仓库的基准数据集,并引入两个新的仓库级评价指标:依赖覆盖率和成功率。结果表明,在三种不同的代码大语言模型上,ToolGen将依赖覆盖率显著提升15.2%至45.8%,成功率提升10.9%至42.2%,同时保持广泛认可的相似度指标的竞争力。此外,泛化性评估证实ToolGen在多种代码大语言模型(包括不同架构和规模的模型)上均能保持稳定的性能表现。