Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation

We introduce Bonito, an open-source model for conditional task generation that converts unannotated text into task-specific training datasets for instruction tuning. We aim to enable zero-shot task adaptation of large language models on users' specialized, private data. We train Bonito by fine-tuning a pretrained large language model on a new large-scale dataset with 1.65M examples created by remixing existing instruction tuning datasets into meta-templates. The meta-templates for a dataset produce training examples where the input is the unannotated text and the task attribute and the output consists of the instruction and the response. We use Bonito to generate synthetic tasks for seven datasets from specialized domains with unannotated text across three task types -- yes-no question answering, extractive question answering, and natural language inference -- and adapt language models. We show that Bonito significantly improves the average performance of pretrained and instruction tuned models over the de facto self supervised baseline. For example, adapting Mistral-Instruct-v2 and instruction tuned variants of Mistral and Llama2 with Bonito improves the strong zero-shot performance by 22.1 F1 points whereas the next word prediction objective undoes some of the benefits of instruction tuning and reduces the average performance by 0.8 F1 points. We conduct additional experiments with Bonito to understand the effects of the domain, the size of the training set, and the choice of alternative synthetic task generators. Overall, we show that learning with synthetic instruction tuning datasets is an effective way to adapt language models to new domains. The model, dataset, and code are available at https://github.com/BatsResearch/bonito.

翻译：我们提出了Bonito，一种用于条件任务生成的开源模型，能够将未标注文本转化为用于指令调优的任务特定训练数据集。我们的目标是使大语言模型能够在用户专业化的私有数据上实现零样本任务适应。我们通过在新的包含165万个样本的大规模数据集上对预训练大语言模型进行微调来训练Bonito，该数据集通过将现有指令调优数据集重组为元模板构建而成。数据集的元模板生成的训练样本以未标注文本和任务属性作为输入，以指令和响应作为输出。我们使用Bonito为来自专业领域的七个数据集生成合成任务，涵盖三种任务类型——是非问答、抽取式问答和自然语言推理，并基于此对语言模型进行适应。实验表明，与主流的自监督基线相比，Bonito显著提升了预训练模型及指令调优模型的平均性能。例如，使用Bonito对Mistral-Instruct-v2以及Mistral和Llama2的指令调优变体进行适应后，其强大的零样本性能提升了22.1个F1分数；而传统的下一词预测目标则会抵消部分指令调优的优势，导致平均性能下降0.8个F1分数。我们通过Bonito进行了额外实验，以探究领域特性、训练集规模以及替代性合成任务生成器选择的影响。总体而言，我们的研究表明，利用合成指令调优数据集进行学习是使语言模型适应新领域的有效方法。模型、数据集及代码已发布于https://github.com/BatsResearch/bonito。