We introduce Magicoder, a series of fully open-source (code, weights, and data) Large Language Models (LLMs) for code that significantly closes the gap with top code models while having no more than 7B parameters. Magicoder models are trained on 75K synthetic instruction data using OSS-Instruct, a novel approach to enlightening LLMs with open-source code snippets to generate diverse instruction data for code. Our main motivation is to mitigate the inherent bias of the synthetic data generated by LLMs through the wealth of open-source references for the production of more realistic and controllable data. The orthogonality of OSS-Instruct and other data generation methods like Evol-Instruct further enables us to build an enhanced MagicoderS. Both Magicoder and MagicoderS substantially outperform state-of-the-art code models with similar or even larger sizes on a wide range of coding benchmarks. Notably, MagicoderS-CL-7B based on CodeLlama even surpasses the prominent ChatGPT on HumanEval+ (66.5 vs. 65.9 in pass@1 ). Overall, OSS-Instruct opens a new direction for crafting diverse synthetic instruction data for code using abundant open-source references.
翻译:本文介绍Magicoder系列模型——一组完全开源(代码、权重及数据)的代码生成专用大语言模型。该系列模型参数量不超过70亿,却能显著缩小与顶尖代码模型之间的性能差距。Magicoder模型采用OSS-Instruct这一创新方法进行训练,该方法利用开源代码片段启发大语言模型生成多样化的代码指令数据,训练数据规模达7.5万条合成指令。我们的核心动机是通过开源代码库的丰富资源来缓解大语言模型生成合成数据时固有的偏见,从而产生更贴近现实且可控的数据。OSS-Instruct与Evol-Instruct等其他数据生成方法的正交性,使我们能够进一步构建增强型MagicoderS模型。在广泛的代码评测基准上,Magicoder与MagicoderS均显著优于同类规模甚至更大规模的最先进代码模型。值得注意的是,基于CodeLlama的MagicoderS-CL-7B在HumanEval+基准上甚至超越了表现卓越的ChatGPT(pass@1得分66.5对65.9)。总体而言,OSS-Instruct为利用海量开源资源构建多样化的代码合成指令数据开辟了新方向。