Recent advancements in open-source code large language models (LLMs) have demonstrated remarkable coding abilities by fine-tuning on the data generated from powerful closed-source LLMs such as GPT-3.5 and GPT-4 for instruction tuning. This paper explores how to further improve an instruction-tuned code LLM by generating data from itself rather than querying closed-source LLMs. Our key observation is the misalignment between the translation of formal and informal languages: translating formal language (i.e., code) to informal language (i.e., natural language) is more straightforward than the reverse. Based on this observation, we propose INVERSE-INSTRUCT, which summarizes instructions from code snippets instead of the reverse. Specifically, given an instruction tuning corpus for code and the resulting instruction-tuned code LLM, we ask the code LLM to generate additional high-quality instructions for the original corpus through code summarization and self-evaluation. Then, we fine-tune the base LLM on the combination of the original corpus and the self-generated one, which yields a stronger instruction-tuned LLM. We present a series of code LLMs named InverseCoder, which surpasses the performance of the original code LLMs on a wide range of benchmarks, including Python text-to-code generation, multilingual coding, and data-science code generation.
翻译:开源代码大语言模型(LLMs)的最新进展表明,通过使用GPT-3.5和GPT-4等强大闭源LLMs生成的数据进行指令调优,可以显著提升模型的编程能力。本文探讨了如何通过模型自身生成数据而非查询闭源LLMs,来进一步提升指令调优后的代码LLM。我们的核心发现是形式语言与非形式语言翻译之间的不对等性:将形式语言(即代码)翻译为非形式语言(即自然语言)比反向翻译更为直接。基于此观察,我们提出了INVERSE-INSTRUCT方法,该方法从代码片段中总结指令,而非相反过程。具体而言,给定一个用于代码的指令调优数据集及其产生的指令调优代码LLM,我们要求该代码LLM通过代码摘要与自评估,为原始数据集生成额外的高质量指令。随后,我们在原始数据集与自生成数据集的组合上对基础LLM进行微调,从而得到一个更强大的指令调优LLM。我们提出了一系列名为InverseCoder的代码LLMs,其在多项基准测试中超越了原始代码LLMs的性能,包括Python文本到代码生成、多语言编程以及数据科学代码生成。