In this paper, we conduct a thorough investigation into the reasoning capabilities of Large Language Models (LLMs), focusing specifically on the Open Pretrained Transformers (OPT) models as a representative of such models. Our study entails finetuning three different sizes of OPT on a carefully curated reasoning corpus, resulting in two sets of finetuned models: OPT-R, finetuned without explanations, and OPT-RE, finetuned with explanations. We then evaluate all models on 57 out-of-domain tasks drawn from the SUPER-NATURALINSTRUCTIONS benchmark, covering 26 distinct reasoning skills, utilizing three prompting techniques. Through a comprehensive grid of 27 configurations and 6,156 test evaluations, we investigate the dimensions of finetuning, prompting, and scale to understand the role of explanations on different reasoning skills. Our findings reveal that having explanations in the fewshot exemplar has no significant impact on the model's performance when the model is finetuned, while positively affecting the non-finetuned counterpart. Moreover, we observe a slight yet consistent increase in classification accuracy as we incorporate explanations during prompting and finetuning, respectively. Finally, we offer insights on which skills benefit the most from incorporating explanations during finetuning and prompting, such as Numerical (+20.4%) and Analogical (+13.9%) reasoning, as well as skills that exhibit negligible or negative effects.
翻译:本文对大型语言模型(LLM)的推理能力进行了深入探究,特别关注作为此类模型代表的开源预训练Transformer(OPT)模型。我们的研究包括在精心整理的推理语料库上对三种不同规模的OPT进行微调,从而得到两组微调模型:不包含解释进行微调的OPT-R,以及包含解释进行微调的OPT-RE。随后,我们利用三种提示技术,在SUPER-NATURALINSTRUCTIONS基准测试框架下的57个领域外任务(涵盖26种不同推理技能)上对所有模型进行了评估。通过一个包含27种配置和6,156次测试评估的全面网格实验,我们从微调、提示和规模三个维度展开研究,以理解解释对不同推理技能的作用。我们的发现表明:当模型经过微调后,小样本示例中包含解释对模型性能无显著影响,而对未微调的模型则产生积极影响。此外,我们在提示和微调过程中分别融入解释时,观察到分类准确率有轻微但持续的提升。最后,我们揭示了哪些技能最能从微调和提示过程中融入解释中受益,例如数值推理(+20.4%)和类比推理(+13.9%),以及那些呈现可忽略或负面影响技能。