Recent advancements in open-source code large language models (LLMs) have been driven by fine-tuning on the data generated from powerful closed-source LLMs, which are expensive to obtain. This paper explores whether it is possible to use a fine-tuned open-source model to generate additional data to augment its instruction-tuning dataset. We make two observations: (1) A code snippet can serve as the response to different instructions. (2) Instruction-tuned code LLMs perform better at translating code into instructions than the reverse. Based on these observations, we propose Inverse-Instruct, a data augmentation technique that uses a fine-tuned LLM to generate additional instructions of code responses from its own training dataset. The additional instruction-response pairs are added to the original dataset, and a stronger code LLM can be obtained by fine-tuning on the augmented dataset. We empirically validate Inverse-Instruct on a range of open-source code models (e.g. CodeLlama-Python and DeepSeek-Coder) and benchmarks (e.g., HumanEval(+), MBPP(+), DS-1000 and MultiPL-E), showing it consistently improves the base models.
翻译:近期开源代码大语言模型的进展主要依赖于对强大闭源模型生成数据进行微调,而此类数据获取成本高昂。本文探讨是否可能利用已微调的开源模型生成额外数据以扩充其指令调优数据集。我们提出两点观察:(1)同一代码片段可作为不同指令的响应;(2)指令调优后的代码大语言模型在将代码转换为指令方面的表现优于逆向转换。基于这些观察,我们提出逆向指令数据增强技术,该方法利用已微调的模型从其自身训练数据集中为代码响应生成额外指令。生成的指令-响应对将被添加至原始数据集,通过在增强数据集上进行微调可获得性能更强的代码大语言模型。我们在多个开源代码模型(如CodeLlama-Python和DeepSeek-Coder)和基准测试(如HumanEval(+)、MBPP(+)、DS-1000和MultiPL-E)上实证验证了逆向指令方法的有效性,结果表明该方法能持续提升基础模型性能。