Deploying large language models (LLMs) is challenging because they are memory inefficient and compute-intensive for practical applications. In reaction, researchers train smaller task-specific models by either finetuning with human labels or distilling using LLM-generated labels. However, finetuning and distillation require large amounts of training data to achieve comparable performance to LLMs. We introduce Distilling step-by-step, a new mechanism that (a) trains smaller models that outperform LLMs, and (b) achieves so by leveraging less training data needed by finetuning or distillation. Our method extracts LLM rationales as additional supervision for small models within a multi-task training framework. We present three findings across 4 NLP benchmarks: First, compared to both finetuning and distillation, our mechanism achieves better performance with much fewer labeled/unlabeled training examples. Second, compared to LLMs, we achieve better performance using substantially smaller model sizes. Third, we reduce both the model size and the amount of data required to outperform LLMs; our 770M T5 model outperforms the 540B PaLM model using only 80% of available data on a benchmark task.
翻译:部署大型语言模型(LLM)面临挑战,因其在实际应用中存在内存效率低和计算密集的问题。为此,研究人员通过两种途径训练较小的任务专用模型:使用人工标注进行微调,或利用LLM生成的标签进行蒸馏。然而,微调和蒸馏都需要大量训练数据才能达到与LLM相当的性能。我们提出“逐步提炼”(Distilling step-by-step)这一新机制,它能够(a)训练出超越LLM性能的小模型,且(b)通过利用比微调或蒸馏所需更少的训练数据实现上述目标。该方法在多任务训练框架中提取LLM的推理过程作为小模型的额外监督。我们在4个自然语言处理基准测试中发现三个关键结论:第一,与微调和蒸馏相比,该机制能用更少的标注/未标注训练样本取得更优性能;第二,与LLM相比,我们用显著更小的模型规模实现了更优性能;第三,我们同时减少了模型规模和所需数据量,成功超越LLM——在某一基准任务中,仅使用80%可用数据,770M参数的T5模型便超越了540B参数的PaLM模型。