Step-by-step reasoning approaches like chain of thought (CoT) have proved to be very effective in inducing reasoning capabilities in large language models. However, the success of the CoT approach is fundamentally tied to the model size, and billion parameter-scale models are often needed to get CoT to work. In this paper, we propose a knowledge distillation approach that leverages the step-by-step CoT reasoning capabilities of larger models and distills these abilities into smaller models. In this work, we propose an alternative reasoning scheme, Socratic CoT, that learns a decomposition of the original problem into a sequence of subproblems and uses it to guide the intermediate reasoning steps. We use Socratic CoT to train a combination of two small distilled models: a problem decomposer and a subproblem solver. In practice, given a new problem, the two distilled models work in sync to decompose and solve complex problems. On multiple reasoning datasets (GSM8K, StrategyQA, and SVAMP), our proposed distillation strategies boosts the performance of smaller models over 70% compared to the baselines. Finally, we investigate when Socratic CoT is an effective alternative to CoT, demonstrating cases where a much smaller model (GPT-2 large) can outperform a 10X larger model (GPT-3 6B). Our code is available here: https://github.com/kumar-shridhar/Distiiling-LM
翻译:逐步推理方法(如思维链,CoT)已被证明能有效激发大型语言模型的推理能力。然而,CoT方法的成功本质上依赖于模型规模,通常需要数十亿参数级别的模型才能实现CoT的有效运作。本文提出一种知识蒸馏方法,利用大型模型的逐步CoT推理能力,并将这些能力蒸馏至小型模型中。我们提出一种替代推理方案——苏格拉底式CoT(Socratic CoT),该方案学习将原始问题分解为一系列子问题,并以此引导中间推理步骤。我们使用Socratic CoT训练两个小型蒸馏模型的组合:问题分解器与子问题求解器。在实践中,针对新问题,两个蒸馏模型协同工作,完成复杂问题的分解与求解。在多个推理数据集(GSM8K、StrategyQA和SVAMP)上,我们提出的蒸馏策略使小型模型性能较基线提升超过70%。最后,我们探究Socratic CoT在何种情况下可有效替代CoT,并展示小型模型(GPT-2 large)在某些场景中超越10倍规模模型(GPT-3 6B)的案例。我们的代码开源于:https://github.com/kumar-shridhar/Distiiling-LM