Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

Haoran Li,Qingxiu Dong,Zhengyang Tang,Chaojun Wang,Xingxing Zhang,Haoyang Huang,Shaohan Huang,Xiaolong Huang,Zeqiang Huang,Dongdong Zhang,Yuxian Gu,Xin Cheng,Xun Wang,Si-Qing Chen,Li Dong,Wei Lu,Zhifang Sui,Benyou Wang,Wai Lam,Furu Wei

from arxiv, Work in progress

We introduce Generalized Instruction Tuning (called GLAN), a general and scalable method for instruction tuning of Large Language Models (LLMs). Unlike prior work that relies on seed examples or existing datasets to construct instruction tuning data, GLAN exclusively utilizes a pre-curated taxonomy of human knowledge and capabilities as input and generates large-scale synthetic instruction data across all disciplines. Specifically, inspired by the systematic structure in human education system, we build the taxonomy by decomposing human knowledge and capabilities to various fields, sub-fields and ultimately, distinct disciplines semi-automatically, facilitated by LLMs. Subsequently, we generate a comprehensive list of subjects for every discipline and proceed to design a syllabus tailored to each subject, again utilizing LLMs. With the fine-grained key concepts detailed in every class session of the syllabus, we are able to generate diverse instructions with a broad coverage across the entire spectrum of human knowledge and skills. Extensive experiments on large language models (e.g., Mistral) demonstrate that GLAN excels in multiple dimensions from mathematical reasoning, coding, academic exams, logical reasoning to general instruction following without using task-specific training data of these tasks. In addition, GLAN allows for easy customization and new fields or skills can be added by simply incorporating a new node into our taxonomy.

翻译：我们提出了一种通用的、可扩展的指令微调方法——广义指令微调（简称GLAN），用于大规模语言模型（LLM）的指令微调。与现有依赖种子示例或现有数据集构建指令微调数据的方法不同，GLAN仅以预先设计的人类知识与能力分类体系作为输入，即可跨所有学科生成大规模合成指令数据。具体而言，受人类教育体系系统性结构的启发，我们通过半自动方式分解人类知识与能力，借助LLM辅助构建涵盖领域、子领域直至具体学科的分类体系。随后，针对每门学科生成完整的学科列表，并利用LLM为每门学科设计定制化教学大纲。基于教学大纲中每节课涵盖的细粒度核心概念，我们能够生成覆盖人类知识与技能全谱系的多样化指令。在Mistral等大规模语言模型上的大量实验表明，GLAN在数学推理、代码生成、学术考试、逻辑推理乃至通用指令遵循等多个维度均表现优异，且无需使用这些任务的特定训练数据。此外，GLAN支持便捷的定制化扩展，仅需在分类体系中添加新节点即可融入新领域或技能。