Large language models (LLMs), such as Codex and GPT-4, have recently showcased their remarkable code generation abilities, facilitating a significant boost in coding efficiency. This paper will delve into utilizing LLMs for code generation in private libraries, as they are widely employed in everyday programming. Despite their remarkable capabilities, generating such private APIs poses a formidable conundrum for LLMs, as they inherently lack exposure to these private libraries during pre-training. To address this challenge, we propose a novel framework that emulates the process of programmers writing private code. This framework comprises two modules: APIFinder first retrieves potentially useful APIs from API documentation; and APICoder then leverages these retrieved APIs to generate private code. Specifically, APIFinder employs vector retrieval techniques and allows user involvement in the retrieval process. For APICoder, it can directly utilize off-the-shelf code generation models. To further cultivate explicit proficiency in invoking APIs from prompts, we continuously pre-train a reinforced version of APICoder, named CodeGenAPI. Our goal is to train the above two modules on vast public libraries, enabling generalization to private ones. Meanwhile, we create four private library benchmarks, including TorchDataEval, TorchDataComplexEval, MonkeyEval, and BeatNumEval, and meticulously handcraft test cases for each benchmark to support comprehensive evaluations. Numerous experiments on the four benchmarks consistently affirm the effectiveness of our approach. Furthermore, deeper analysis is also conducted to glean additional insights.
翻译:大语言模型(LLMs,如Codex和GPT-4)近期展示了卓越的代码生成能力,显著提升了编程效率。本文聚焦于利用大语言模型为私有库生成代码——这些私有库广泛用于日常编程。尽管大语言模型能力突出,但生成此类私有API对其构成严峻挑战,因为它们在预训练阶段天然缺乏对这些私有库的暴露。为解决该问题,我们提出一种模拟程序员编写私有代码流程的新型框架。该框架包含两个模块:APIFinder首先从API文档中检索潜在有用的API;APICoder随后利用检索到的API生成私有代码。具体而言,APIFinder采用向量检索技术,并支持用户参与检索过程;APICoder可直接使用现成的代码生成模型。为进一步培养从提示中精准调用API的能力,我们对APICoder进行了持续预训练,得到增强版本CodeGenAPI。本目标是在大规模公共库上训练上述两个模块,使其能泛化至私有库。同时,我们构建了四个私有库基准(TorchDataEval、TorchDataComplexEval、MonkeyEval和BeatNumEval),并为每个基准精心设计了测试用例以支持全面评估。在四个基准上的大量实验一致验证了方法有效性。此外,深入分析还揭示了更多洞见。