Recent advancements in Large Language Models (LLMs) and their utilization in code generation tasks have significantly reshaped the field of software development. Despite the remarkable efficacy of code completion solutions in mainstream programming languages, their performance lags when applied to less ubiquitous formats such as OpenAPI definitions. This study evaluates the OpenAPI completion performance of GitHub Copilot, a prevalent commercial code completion tool, and proposes a set of task-specific optimizations leveraging Meta's open-source model Code Llama. A semantics-aware OpenAPI completion benchmark proposed in this research is used to perform a series of experiments through which the impact of various prompt-engineering and fine-tuning techniques on the Code Llama model's performance is analyzed. The fine-tuned Code Llama model reaches a peak correctness improvement of 55.2% over GitHub Copilot despite utilizing 25 times fewer parameters than the commercial solution's underlying Codex model. Additionally, this research proposes an enhancement to a widely used code infilling training technique, addressing the issue of underperformance when the model is prompted with context sizes smaller than those used during training. The dataset, the benchmark, and the model fine-tuning code are made publicly available.
翻译:近年来,大语言模型及其在代码生成任务中的应用显著重塑了软件开发领域。尽管代码补全解决方案在主流编程语言中展现出显著效果,但在OpenAPI定义等非通用格式上的性能却相对滞后。本研究评估了主流商业代码补全工具GitHub Copilot在OpenAPI补全任务中的表现,并提出一组基于Meta开源模型Code Llama的任务特定优化方案。通过本研究提出的语义感知型OpenAPI补全基准测试,我们开展了一系列实验,系统分析了提示工程与微调技术对Code Llama模型性能的影响。经微调的Code Llama模型在参数规模仅为商业解决方案底层Codex模型二十五分之一的情况下,其正确率较GitHub Copilot最高提升55.2%。此外,本研究还改进了广泛使用的代码填充训练技术,解决了模型在输入上下文规模小于训练规模时表现欠佳的问题。相关数据集、基准测试代码及模型微调代码均已公开。