Large Language Models (LLMs) have shown enhanced capabilities of solving novel tasks by reasoning step-by-step known as Chain-of-Thought (CoT) reasoning; how can we instill the same capability of reasoning step-by-step on unseen tasks into LMs that possess less than <100B parameters? To address this question, we first introduce the CoT Collection, a new instruction-tuning dataset that augments 1.88 million CoT rationales across 1,060 tasks. We show that continually fine-tuning Flan-T5 (3B & 11B) with the CoT Collection enables the 3B & 11B LMs to perform CoT better on unseen tasks, leading to an improvement in the average zero-shot accuracy on 27 datasets of the BIG-Bench-Hard benchmark by +4.34% and +2.44%, respectively. Furthermore, we show that instruction tuning with CoT allows LMs to possess stronger few-shot learning capabilities, resulting in an improvement of +2.97% and +2.37% on 4 domain-specific tasks over Flan-T5 (3B & 11B), respectively. We make our CoT Collection data and our trained models publicly available at https://github.com/kaist-lklab/CoT-Collection.
翻译:大型语言模型(LLMs)已展现出通过逐步推理(即思维链推理)解决新任务的增强能力;如何在参数少于100B的语言模型中灌输对未见任务进行逐步推理的相同能力?为回答这一问题,我们首先引入CoT集合——一个新增了1060项任务中188万条思维链推理过程的新型指令微调数据集。实验表明,利用CoT集合持续微调Flan-T5(3B与11B)可使这两种参数规模的模型在未见任务中更好地执行思维链推理,分别将BIG-Bench-Hard基准中27个数据集的平均零样本准确率提升4.34%和2.44%。此外,我们证明基于思维链的指令微调能够赋予语言模型更强的少样本学习能力,在4个领域特定任务上相较Flan-T5(3B与11B)分别获得2.97%和2.37%的性能提升。我们将CoT集合数据及训练模型公开发布于https://github.com/kaist-lklab/CoT-Collection。