OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization

Srinivasan Iyer,Xi Victoria Lin,Ramakanth Pasunuru,Todor Mihaylov,Daniel Simig,Ping Yu,Kurt Shuster,Tianlu Wang,Qing Liu,Punit Singh Koura,Xian Li,Brian O'Horo,Gabriel Pereyra,Jeff Wang,Christopher Dewan,Asli Celikyilmaz,Luke Zettlemoyer,Ves Stoyanov

from arxiv, 56 pages. v2->v3: fix OPT-30B evaluation results across benchmarks (previously we reported lower performance of this model due to an evaluation pipeline bug)

Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks described via instructions, a.k.a. instruction-tuning, improves their zero and few-shot generalization to unseen tasks. However, there is a limited understanding of the performance trade-offs of different decisions made during the instruction-tuning process. These decisions include the scale and diversity of the instruction-tuning benchmark, different task sampling strategies, fine-tuning with and without demonstrations, training using specialized datasets for reasoning and dialogue, and finally, the fine-tuning objectives themselves. In this paper, we characterize the effect of instruction-tuning decisions on downstream task performance when scaling both model and benchmark sizes. To this end, we create OPT-IML Bench: a large benchmark for Instruction Meta-Learning (IML) of 2000 NLP tasks consolidated into task categories from 8 existing benchmarks, and prepare an evaluation framework to measure three types of model generalizations: to tasks from fully held-out categories, to held-out tasks from seen categories, and to held-out instances from seen tasks. Through the lens of this framework, we first present insights about instruction-tuning decisions as applied to OPT-30B and further exploit these insights to train OPT-IML 30B and 175B, which are instruction-tuned versions of OPT. OPT-IML demonstrates all three generalization abilities at both scales on four different evaluation benchmarks with diverse tasks and input formats -- PromptSource, FLAN, Super-NaturalInstructions, and UnifiedSKG. Not only does it significantly outperform OPT on all benchmarks but is also highly competitive with existing models fine-tuned on each specific benchmark. We release OPT-IML at both scales, together with the OPT-IML Bench evaluation framework.

翻译：近期研究表明，在由指令描述的任务集合上对大规模预训练语言模型进行微调（即指令微调）可提升其在未见任务上的零样本和少样本泛化能力。然而，当前对指令微调过程中不同决策带来的性能权衡理解尚不充分。这些决策包括：指令微调基准的规模与多样性、不同任务采样策略、有无示范的微调方式、使用推理和对话专用数据集的训练方案，以及微调目标本身。本文系统刻画了在同时扩展模型和基准规模时，指令微调决策对下游任务性能的影响。为此，我们构建了OPT-IML Bench：一个面向指令元学习（IML）的大型基准测试集，整合了来自8个现有基准测试集的2000个NLP任务（按任务类别归并），并设计了评估框架来度量三类模型泛化能力：完全未见类别任务泛化、已知类别中的未见任务泛化，以及已知任务中的未见实例泛化。基于该框架，我们首先针对应用于OPT-30B的指令微调决策提炼洞见，进而利用这些发现训练OPT-IML 30B和175B——OPT模型经指令微调后的版本。在包含多样化任务和输入格式的四个不同评估基准（PromptSource、FLAN、Super-NaturalInstructions和UnifiedSKG）上，OPT-IML在两个模型规模下均展现出所有三类泛化能力。其不仅在所有基准测试中显著优于原始OPT，还具备与各基准专用微调模型的高度竞争力。我们已发布两种规模的OPT-IML及OPT-IML Bench评估框架。