Few-shot fine-tuning and in-context learning are two alternative strategies for task adaptation of pre-trained language models. Recently, in-context learning has gained popularity over fine-tuning due to its simplicity and improved out-of-domain generalization, and because extensive evidence shows that fine-tuned models pick up on spurious correlations. Unfortunately, previous comparisons of the two approaches were done using models of different sizes. This raises the question of whether the observed weaker out-of-domain generalization of fine-tuned models is an inherent property of fine-tuning or a limitation of the experimental setup. In this paper, we compare the generalization of few-shot fine-tuning and in-context learning to challenge datasets, while controlling for the models used, the number of examples, and the number of parameters, ranging from 125M to 30B. Our results show that fine-tuned language models can in fact generalize well out-of-domain. We find that both approaches generalize similarly; they exhibit large variation and depend on properties such as model size and the number of examples, highlighting that robust task adaptation remains a challenge.
翻译:少样本微调和上下文学习是预训练语言模型任务适配的两种替代策略。近期,上下文学习因操作简便且能提升域外泛化能力而比微调更受青睐,同时大量证据表明微调模型会学习虚假关联。遗憾的是,此前对两种方法的比较均基于不同规模的模型。这引发了一个问题:观察到的微调模型域外泛化能力较弱,究竟是微调方法的内在属性,还是实验设计的局限性?本文在控制模型架构、样本数量和参数规模(从1.25亿到300亿参数)的条件下,比较了少样本微调与上下文学习在挑战性数据集上的泛化表现。结果表明,微调语言模型实际上能实现良好的域外泛化。我们发现两种方法的泛化表现相似,均呈现较大差异,且依赖于模型规模和样本数量等特性,这凸显了稳健的任务适配仍是一项挑战。