We introduce a novel framework, LM-Guided CoT, that leverages a lightweight (i.e., <1B) language model (LM) for guiding a black-box large (i.e., >10B) LM in reasoning tasks. Specifically, the lightweight LM first generates a rationale for each input instance. The Frozen large LM is then prompted to predict a task output based on the rationale generated by the lightweight LM. Our approach is resource-efficient in the sense that it only requires training the lightweight LM. We optimize the model through 1) knowledge distillation and 2) reinforcement learning from rationale-oriented and task-oriented reward signals. We assess our method with multi-hop extractive question answering (QA) benchmarks, HotpotQA, and 2WikiMultiHopQA. Experimental results show that our approach outperforms all baselines regarding answer prediction accuracy. We also find that reinforcement learning helps the model to produce higher-quality rationales with improved QA performance.
翻译:我们提出了一种新颖的框架——LM引导的思维链(LM-Guided CoT),该框架利用轻量级(即参数量<1B)语言模型(LM)来指导黑盒大型(即参数量>10B)语言模型执行推理任务。具体而言,轻量级语言模型首先为每个输入实例生成推理依据,随后冻结的大型语言模型基于该推理依据被提示预测任务输出。我们的方法具有资源高效性,其优势在于仅需训练轻量级语言模型。我们通过以下方式优化模型:1)知识蒸馏;2)基于推理导向和任务导向的奖励信号进行强化学习。我们采用多跳抽取式问答(QA)基准数据集HotpotQA与2WikiMultiHopQA评估了该方法。实验结果表明,我们的方法在答案预测准确率上超越了所有基线模型。此外,我们发现强化学习有助于模型生成更高质量的推理依据,并提升了问答性能。