We introduce Magicoder, a series of fully open-source (code, weights, and data) Large Language Models (LLMs) for code that significantly closes the gap with top code models while having no more than 7B parameters. Magicoder models are trained on 75K synthetic instruction data using OSS-Instruct, a novel approach to enlightening LLMs with open-source code snippets to generate high-quality instruction data for code. Our main motivation is to mitigate the inherent bias of the synthetic data generated by LLMs by empowering them with a wealth of open-source references for the production of more diverse, realistic, and controllable data. The orthogonality of OSS-Instruct and other data generation methods like Evol-Instruct further enables us to build an enhanced MagicoderS. Both Magicoder and MagicoderS substantially outperform state-of-the-art code models with similar or even larger sizes on a wide range of coding benchmarks, including Python text-to-code generation, multilingual coding, and data-science program completion. Notably, MagicoderS-CL-7B based on CodeLlama even surpasses the prominent ChatGPT on HumanEval+ (66.5 vs. 65.9 in pass@1). Overall, OSS-Instruct opens a new direction for low-bias and high-quality instruction tuning using abundant open-source references.
翻译:我们推出Magicoder系列——一组完全开源(代码、权重及数据)的大型语言模型(LLMs)专用于代码生成,其参数量不超过7B,却显著拉近了与顶尖代码模型间的差距。Magicoder模型基于75K条合成指令数据训练,采用OSS-Instruct这一创新方法,通过向大语言模型注入开源代码片段来生成高质量代码指令数据。我们的核心动机在于:通过赋予模型丰富开源参考资源,生成更多样化、更真实且更可控的数据,从而缓解大模型生成数据固有的偏差。OSS-Instruct与Evol-Instruct等数据生成方法的正交性,使我们得以构建增强版MagicoderS。Magicoder与MagicoderS在Python文本到代码生成、多语言编程及数据科学程序补全等广泛编码基准测试中,均显著超越同参数量级甚至更大规模的最先进代码模型。值得注意的是,基于CodeLlama的MagicoderS-CL-7B在HumanEval+上甚至超越了著名的ChatGPT(pass@1指标:66.5 vs 65.9)。总体而言,OSS-Instruct为利用丰富开源参考资源实现低偏差、高质量的指令微调开辟了新方向。