Instruction tuning has emerged to enhance the capabilities of large language models (LLMs) to comprehend instructions and generate appropriate responses. Existing methods either manually annotate or employ LLM (e.g., GPT-series) to generate data for instruction tuning. However, they often overlook associating instructions with existing annotated datasets. In this paper, we propose Dynosaur, a dynamic growth paradigm for the automatic curation of instruction-tuning data. Based on the metadata of existing datasets, we use LLMs to automatically construct instruction-tuning data by identifying relevant data fields and generating appropriate instructions. By leveraging the existing annotated datasets, Dynosaur offers several advantages: 1) it reduces the API cost for generating instructions (e.g., it costs less than $12 USD by calling GPT-3.5-turbo for generating 800K instruction tuning samples; 2) it provides high-quality data for instruction tuning (e.g., it performs better than Alpaca and Flan on Super-NI and Longform with comparable data sizes); and 3) it supports the continuous improvement of models by generating instruction-tuning data when a new annotated dataset becomes available. We further investigate a continual learning scheme for learning with the ever-growing instruction-tuning dataset, and demonstrate that replaying tasks with diverse instruction embeddings not only helps mitigate forgetting issues but generalizes to unseen tasks better. Code and data are available at https://github.com/WadeYin9712/Dynosaur.
翻译:指令微调技术旨在增强大语言模型理解指令并生成恰当响应的能力。现有方法通过人工标注或调用大语言模型(如GPT系列)生成指令微调数据,但往往忽略了将指令与现有标注数据集建立关联。本文提出Dynosaur——一种用于自动策展指令微调数据的动态增长范式。该方法基于现有数据集的元数据,通过识别相关数据字段并生成适当指令,利用大语言模型自动构建指令微调数据。通过复用现有标注数据集,Dynosaur具备以下优势:1)降低指令生成的API成本(例如,调用GPT-3.5-turbo生成80万条指令微调样本的成本不足12美元);2)提供高质量指令微调数据(例如,在数据规模相当时,其在Super-NI和Longform任务上的表现优于Alpaca和Flan);3)支持模型持续改进——当新标注数据集发布时,可即时生成对应的指令微调数据。我们进一步研究了基于持续增长指令微调数据集的增量学习方案,实验证明采用多样化指令嵌入的任务重放策略不仅能缓解遗忘问题,还能提升对未见任务的泛化能力。代码与数据已开源至https://github.com/WadeYin9712/Dynosaur。