Instruction tuning enables language models to generalize more effectively and better follow user intent. However, obtaining instruction data can be costly and challenging. Prior works employ methods such as expensive human annotation, crowd-sourced datasets with alignment issues, or generating noisy examples via LLMs. We introduce the LongForm dataset, which is created by leveraging English corpus examples with augmented instructions. We select a diverse set of human-written documents from existing corpora such as C4 and Wikipedia and generate instructions for the given documents via LLMs. This approach provides a cheaper and cleaner instruction-tuning dataset and one suitable for long text generation. We finetune T5, OPT, and LLaMA models on our dataset and show that even smaller LongForm models have good generalization capabilities for text generation. Our models outperform 10x larger language models without instruction tuning on various tasks such as story/recipe generation and long-form question answering. Moreover, LongForm models outperform prior instruction-tuned models such as FLAN-T5 and Alpaca by a large margin. Finally, our models can effectively follow and answer multilingual instructions; we demonstrate this for news generation. We publicly release our data and models: https://github.com/akoksal/LongForm.
翻译:指令微调使语言模型能够更有效地泛化并更好地遵循用户意图。然而,获取指令数据的成本高昂且具有挑战性。先前的工作采用了昂贵的人工标注、存在对齐问题的众包数据集,或通过大语言模型生成噪声样本等方法。我们提出了LongForm数据集,该数据集通过利用英语语料库示例与增强指令构建而成。我们从C4和维基百科等现有语料库中选取多样的人工撰写文档,并通过大语言模型为这些文档生成指令。该方法提供了一种更廉价、更清洁的指令微调数据集,且特别适用于长文本生成。我们在该数据集上对T5、OPT和LLaMA模型进行微调,结果表明,即使是较小的LongForm模型在文本生成方面也具有良好的泛化能力。在故事/食谱生成及长文本问答等多种任务中,我们的模型性能超越了未经指令微调的10倍规模的语言模型。此外,LongForm模型在性能上大幅优于先前指令微调模型(如FLAN-T5和Alpaca)。最后,我们的模型能够有效遵循并回答多语言指令——我们通过新闻生成任务验证了这一点。我们已公开发布数据和模型:https://github.com/akoksal/LongForm。