Large Language Models (LLMs) rely on instruction samples for alignment, but creating these datasets poses challenges, particularly in expert-dependent tasks like coding, which can be cost-prohibitive. One approach to mitigate these challenges is synthesizing data using another LLM. In this paper, we introduce a scalable method for generating synthetic instructions to enhance the code generation capability of LLMs. The proposed algorithm, Genetic-Instruct, mimics evolutionary processes, utilizing self-instruction to create numerous synthetic samples from a limited number of seeds. Genetic-Instruct is designed for efficient scaling of the generation process. Fine-tuning multiple coding LLMs with the synthetic samples demonstrates a significant improvement in their code generation accuracy compared to the baselines.
翻译:大型语言模型(LLMs)依赖指令样本进行对齐,但创建这些数据集面临挑战,尤其是在编码等依赖专家的任务中,其成本可能过高。缓解这些挑战的一种方法是使用另一个LLM合成数据。本文提出了一种可扩展的方法来生成合成指令,以增强LLMs的代码生成能力。所提出的算法Genetic-Instruct模拟进化过程,利用自指令从少量种子样本中创建大量合成样本。Genetic-Instruct旨在高效扩展生成过程。使用合成样本对多个编码LLM进行微调后,其代码生成准确率相比基线模型均有显著提升。