Instruction tuning is a supervised fine-tuning approach that significantly improves the ability of large language models (LLMs) to follow human instructions. We propose SelfCodeAlign, the first fully transparent and permissive pipeline for self-aligning code LLMs without extensive human annotations or distillation. SelfCodeAlign employs the same base model for inference throughout the data generation process. It first extracts diverse coding concepts from high-quality seed snippets to generate new tasks. It then samples multiple responses per task, pairs each with test cases, and validates them in a sandbox environment. Finally, passing examples are selected for instruction tuning. In our primary experiments, we use SelfCodeAlign with CodeQwen1.5-7B to generate a dataset of 74k instruction-response pairs. Finetuning on this dataset leads to a model that achieves a 67.1 pass@1 on HumanEval+, surpassing CodeLlama-70B-Instruct despite being ten times smaller. Across all benchmarks, this finetuned model consistently outperforms the original version trained with OctoPack, the previous state-of-the-art method for instruction tuning without human annotations or distillation. Additionally, we show that SelfCodeAlign is effective across LLMs of various sizes, from 3B to 33B, and that the base models can benefit more from alignment with their own data distribution. We further validate each component's effectiveness in our pipeline, showing that SelfCodeAlign outperforms both direct distillation from GPT-4o and leading GPT-3.5-based distillation methods, such as OSS-Instruct and Evol-Instruct. SelfCodeAlign has also led to the creation of StarCoder2-Instruct, the first fully transparent, permissively licensed, and self-aligned code LLM that achieves state-of-the-art coding performance.
翻译:指令调优是一种监督微调方法,能显著提升大语言模型遵循人类指令的能力。本文提出SelfCodeAlign,首个完全透明且许可宽松的管道,可在无需大量人工标注或蒸馏的情况下实现代码大语言模型的自对齐。SelfCodeAlign在整个数据生成过程中使用同一基础模型进行推理。该方法首先从高质量种子代码片段中提取多样化的编程概念以生成新任务,随后对每个任务采样多个响应,将每个响应与测试用例配对,并在沙箱环境中进行验证,最终筛选通过验证的示例用于指令调优。在主要实验中,我们使用SelfCodeAlign与CodeQwen1.5-7B生成了包含74k条指令-响应对的数据集。基于该数据集微调得到的模型在HumanEval+上实现了67.1的pass@1分数,尽管模型规模仅为CodeLlama-70B-Instruct的十分之一,性能却实现了超越。在所有基准测试中,该微调模型均持续优于采用OctoPack(此前无需人工标注或蒸馏的指令调优最优方法)训练的原始版本。此外,我们证明SelfCodeAlign对3B至33B不同规模的LLM均有效,且基础模型通过与其自身数据分布对齐可获得更大收益。我们进一步验证了管道中各组件的有效性,结果表明SelfCodeAlign的性能优于直接基于GPT-4o的蒸馏方法以及主流的GPT-3.5蒸馏方法(如OSS-Instruct和Evol-Instruct)。SelfCodeAlign还催生了StarCoder2-Instruct——首个完全透明、采用宽松许可且通过自对齐实现的代码大语言模型,该模型在代码生成任务中达到了最先进的性能水平。