Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks described via instructions, a.k.a. instruction-tuning, improves their zero and few-shot generalization to unseen tasks. However, there is a limited understanding of the performance trade-offs of different decisions made during the instruction-tuning process. These decisions include the scale and diversity of the instruction-tuning benchmark, different task sampling strategies, fine-tuning with and without demonstrations, training using specialized datasets for reasoning and dialogue, and finally, the fine-tuning objectives themselves. In this paper, we characterize the effect of instruction-tuning decisions on downstream task performance when scaling both model and benchmark sizes. To this end, we create OPT-IML Bench: a large benchmark for Instruction Meta-Learning (IML) of 2000 NLP tasks consolidated into task categories from 8 existing benchmarks, and prepare an evaluation framework to measure three types of model generalizations: to tasks from fully held-out categories, to held-out tasks from seen categories, and to held-out instances from seen tasks. Through the lens of this framework, we first present insights about instruction-tuning decisions as applied to OPT-30B and further exploit these insights to train OPT-IML 30B and 175B, which are instruction-tuned versions of OPT. OPT-IML demonstrates all three generalization abilities at both scales on four different evaluation benchmarks with diverse tasks and input formats -- PromptSource, FLAN, Super-NaturalInstructions, and UnifiedSKG. Not only does it significantly outperform OPT on all benchmarks but is also highly competitive with existing models fine-tuned on each specific benchmark. We release OPT-IML at both scales, together with the OPT-IML Bench evaluation framework.
翻译:近期研究表明,在由指令描述的任务集合上对大规模预训练语言模型进行微调(即指令微调)可提升其在未见任务上的零样本和少样本泛化能力。然而,当前对指令微调过程中不同决策带来的性能权衡理解尚不充分。这些决策包括:指令微调基准的规模与多样性、不同任务采样策略、有无示范的微调方式、使用推理和对话专用数据集的训练方案,以及微调目标本身。本文系统刻画了在同时扩展模型和基准规模时,指令微调决策对下游任务性能的影响。为此,我们构建了OPT-IML Bench:一个面向指令元学习(IML)的大型基准测试集,整合了来自8个现有基准测试集的2000个NLP任务(按任务类别归并),并设计了评估框架来度量三类模型泛化能力:完全未见类别任务泛化、已知类别中的未见任务泛化,以及已知任务中的未见实例泛化。基于该框架,我们首先针对应用于OPT-30B的指令微调决策提炼洞见,进而利用这些发现训练OPT-IML 30B和175B——OPT模型经指令微调后的版本。在包含多样化任务和输入格式的四个不同评估基准(PromptSource、FLAN、Super-NaturalInstructions和UnifiedSKG)上,OPT-IML在两个模型规模下均展现出所有三类泛化能力。其不仅在所有基准测试中显著优于原始OPT,还具备与各基准专用微调模型的高度竞争力。我们已发布两种规模的OPT-IML及OPT-IML Bench评估框架。