Instruction tuning has emerged to enhance the capabilities of large language models (LLMs) in providing appropriate outputs based on input instructions. However, existing methods for collecting instruction-tuning data suffer from limitations in scalability and affordability. In this paper, we propose Dynosaur, a dynamic growth paradigm for instruction-tuning data curation. Built upon the metadata of existing NLP datasets, we generate multiple task instructions applicable to various NLP datasets and determine the relevant data fields for constructing instruction-tuning data with LLMs. Dynosaur offers several advantages: 1) lower generation costs (less than $12 for generating 800K instruction-tuning data), 2) good quality of instruction-tuning data (better performance than Alpaca and Instruction GPT-4 on Super-NI with comparable data sizes), and 3) the ability to grow dynamically by incorporating new datasets from Huggingface Datasets Platform. We further investigate continual learning as an approach to learning with the ever-growing instruction-tuning dataset. We demonstrate that replay methods not only help mitigate forgetting issues but help generalize to unseen tasks better. As a novel continual learning scenario for instruction tuning, selecting tasks based on instruction representations can be an effective replaying strategy. Code and data are released at \url{https://github.com/WadeYin9712/Dynosaur}.
翻译:指令微调技术旨在增强大语言模型(LLMs)根据输入指令生成恰当输出的能力。然而,现有的指令微调数据收集方法在可扩展性和经济性方面存在局限。本文提出Dynosaur——一种用于指令微调数据构建的动态增长范式。该系统基于现有NLP数据集的元数据,生成适用于多种NLP数据集的多任务指令,并利用大语言模型确定构建指令微调数据的相关数据字段。Dynosaur具有以下优势:1)生成成本低(生成80万条指令微调数据成本不到12美元);2)指令微调数据质量优良(在Super-NI基准上,同等数据规模下性能优于Alpaca和Instruction GPT-4);3)能够通过整合Huggingface数据集平台的新数据集实现动态扩展。我们进一步研究了将持续学习应用于不断增长的指令微调数据集的方法,证明回放方法不仅能缓解遗忘问题,还能提升模型在未见任务上的泛化能力。作为指令微调领域的新型持续学习场景,基于指令表征的任务选择可成为有效的回放策略。相关代码与数据已在\url{https://github.com/WadeYin9712/Dynosaur}开源。