We introduce Bonito, an open-source model for conditional task generation: the task of converting unannotated text into task-specific training datasets for instruction tuning. Our goal is to enable zero-shot task adaptation of large language models on users' specialized, private data. We train Bonito on a new large-scale dataset with 1.65M examples created by remixing existing instruction tuning datasets into meta-templates. The meta-templates for a dataset produce training examples where the input is the unannotated text and the task attribute and the output consists of the instruction and the response. We use Bonito to generate synthetic tasks for seven datasets from specialized domains across three task types -- yes-no question answering, extractive question answering, and natural language inference -- and adapt language models. We show that Bonito significantly improves the average performance of pretrained and instruction tuned models over the de facto self supervised baseline. For example, adapting Mistral-Instruct-v2 and instruction tuned variants of Mistral and Llama2 with Bonito improves the strong zero-shot performance by 22.1 F1 points whereas the next word prediction objective undoes some of the benefits of instruction tuning and reduces the average performance by 0.8 F1 points. We conduct additional experiments with Bonito to understand the effects of the domain, the size of the training set, and the choice of alternative synthetic task generators. Overall, we show that learning with synthetic instruction tuning datasets is an effective way to adapt language models to new domains. The model, dataset, and code are available at https://github.com/BatsResearch/bonito.
翻译:我们介绍了Bonito,一个用于条件任务生成的开源模型:该任务旨在将未标注文本转换为适用于指令微调的任务特定训练数据集。我们的目标是使大型语言模型能够在用户专用的私有数据上实现零样本任务适应。Bonito基于一个包含165万示例的新大规模数据集进行训练,该数据集通过将现有指令微调数据集重新组合为元模板而创建。数据集的元模板生成训练示例,其中输入为未标注文本和任务属性,输出包含指令及响应。我们使用Bonito为来自三个任务类型(是非问答、抽取式问答和自然语言推理)的七个专业领域数据集生成合成任务,并适配语言模型。实验表明,与事实上的自监督基线相比,Bonito显著提升了预训练模型和指令微调模型的平均性能。例如,使用Bonito适配Mistral-Instruct-v2以及Mistral和Llama2的指令微调变体,可将强大的零样本性能提升22.1个F1分数点;而采用下一词预测目标则削弱了部分指令微调收益,导致平均性能下降0.8个F1分数点。我们进一步通过实验探究领域差异、训练集规模以及替代性合成任务生成器选择对Bonito的影响。总体而言,研究表明利用合成指令微调数据集进行学习是使语言模型适配新领域的有效方法。模型、数据集及代码已开源至https://github.com/BatsResearch/bonito。