Recent advancements in Large Language Models (LLMs) and their utilization in code generation tasks have significantly reshaped the field of software development. Despite the remarkable efficacy of code completion solutions in mainstream programming languages, their performance lags when applied to less ubiquitous formats such as OpenAPI definitions. This study evaluates the OpenAPI completion performance of GitHub Copilot, a prevalent commercial code completion tool, and proposes a set of task-specific optimizations leveraging Meta's open-source model Code Llama. A semantics-aware OpenAPI completion benchmark proposed in this research is used to perform a series of experiments through which the impact of various prompt-engineering and fine-tuning techniques on the Code Llama model's performance is analyzed. The fine-tuned Code Llama model reaches a peak correctness improvement of 55.2% over GitHub Copilot despite utilizing 25 times fewer parameters than the commercial solution's underlying Codex model. Additionally, this research proposes an enhancement to a widely used code infilling training technique, addressing the issue of underperformance when the model is prompted with context sizes smaller than those used during training.
翻译:近年来,大型语言模型(LLMs)及其在代码生成任务中的应用显著重塑了软件开发领域。尽管代码补全解决方案在主流编程语言中表现出卓越效能,但其在应用于OpenAPI定义等普及度较低的格式时性能有所不足。本研究评估了主流商业代码补全工具GitHub Copilot的OpenAPI补全性能,并提出一套基于Meta开源模型Code Llama的任务特定优化方案。研究提出的语义感知OpenAPI补全基准被用于执行一系列实验,通过分析不同提示工程与微调技术对Code Llama模型性能的影响。经微调的Code Llama模型相比GitHub Copilot实现了最高55.2%的正确率提升,且其参数量仅为商业解决方案底层Codex模型的1/25。此外,本研究针对广泛使用的代码填充训练技术提出改进方案,解决了模型在训练时使用的上下文规模大于推理时提示上下文所导致的性能下降问题。