ToolCoder: Teach Code Generation Models to use API search tools

Automatically generating source code from natural language descriptions has been a growing field of research in recent years. However, current large-scale code generation models often encounter difficulties when selecting appropriate APIs for specific contexts. These models may generate APIs that do not meet requirements or refer to non-existent APIs in third-party libraries, especially for lesser-known or private libraries. Inspired by the process of human developers using tools to search APIs, we propose ToolCoder, a novel approach that integrates API search tools with existing models to assist in code generation and API selection. To teach our model to use tools, we introduce an automated data annotation method using ChatGPT to add tool usage information into the source code data and fine-tune code generation models. During inference, we integrate API search tools into the generation process so that our model can automatically use the search tool to get suggestions when selecting an API. Our experimental results demonstrate that ToolCoder exhibits excellent performance and generalization across five public and private library code generation benchmarks, with at least 6.21\% improvement on average pass@1 metrics and 9.64\% improvement on average pass@10 metrics compared to state-of-the-art methods. Furthermore, we show that our relatively small ToolCoder model is comparable to one of the current best models, GPT-3.5, highlighting the potential of incorporating programming tools into the code generation process.

翻译：近年来，从自然语言描述中自动生成源代码已成为一个不断发展的研究领域。然而，当前大规模代码生成模型在为特定上下文选择合适API时常常遇到困难。这些模型可能生成不符合需求的API，或引用第三方库中不存在的API，尤其是对于较不为人知或私有库而言。受人类开发者使用工具搜索API过程的启发，我们提出了ToolCoder——一种将API搜索工具与现有模型相结合的新方法，以辅助代码生成和API选择。为教会模型使用工具，我们引入了一种自动化数据标注方法，利用ChatGPT向源代码数据添加工具使用信息，并微调代码生成模型。在推理过程中，我们将API搜索工具集成到生成流程中，使模型在选定API时能够自动使用搜索工具获取建议。实验结果表明，ToolCoder在五个公共和私有库代码生成基准测试中展现出卓越的性能和泛化能力，与最先进方法相比，平均pass@1指标提升至少6.21%，平均pass@10指标提升至少9.64%。此外，我们展示了相对较小的ToolCoder模型可与当前最佳模型之一GPT-3.5相媲美，凸显了将编程工具融入代码生成过程的潜力。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

【2023新书】使用Python进行统计和数据可视化，554页pdf

专知会员服务

130+阅读 · 2023年1月29日

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

46+阅读 · 2020年10月31日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【深度学习架构、模型和技巧集合(TensorFlow/PyTorch)】’Deep Learning Models - A collection of various deep learning architectures, models, and tips'

专知会员服务

59+阅读 · 2020年1月25日