Current large language models can perform reasonably well on complex tasks that require step-by-step reasoning with few-shot learning. Are these models applying reasoning skills they have learnt during pre-training and reason outside of their training context, or are they simply memorizing their training corpus at finer granularity and have learnt to better understand their context? To tease apart these possibilities, we introduce ALERT, a benchmark and suite of analyses for assessing language models' reasoning ability comparing pre-trained and finetuned models on complex tasks that require reasoning skills to solve. ALERT provides a test bed to asses any language model on fine-grained reasoning skills, which spans over 20 datasets and covers 10 different reasoning skills. We leverage ALERT to further investigate the role of finetuning. With extensive empirical analysis we find that language models learn more reasoning skills such as textual entailment, abductive reasoning, and analogical reasoning during finetuning stage compared to pretraining state. We also find that when language models are finetuned they tend to overfit to the prompt template, which hurts the robustness of models causing generalization problems.
翻译:当前的大型语言模型在需要逐步推理的复杂任务中,通过少量样本学习能够表现出尚可的性能。这些模型是运用了预训练期间习得的推理能力,在训练语境之外进行推理,还是仅仅以更细粒度记忆了训练语料库,从而学会更好地理解语境?为了厘清这些可能性,我们引入了ALERT——一套基准测试与分析框架,用于评估语言模型在需要推理技能的复杂任务上的推理能力,对比预训练模型与微调模型的表现。ALERT提供了评估任意语言模型细粒度推理能力的测试平台,涵盖20多个数据集和10种不同的推理技能。我们利用ALERT进一步探究微调的作用。通过广泛的实证分析发现,与预训练阶段相比,语言模型在微调阶段习得了更多推理技能,如文本蕴含、溯因推理和类比推理。我们还发现,当语言模型被微调时,它们往往会对提示模板产生过拟合,这会损害模型的鲁棒性,导致泛化问题。