Chain-of-Thought (CoT) prompting has proven to be effective in enhancing the reasoning capabilities of Large Language Models (LLMs) with at least 100 billion parameters. However, it is ineffective or even detrimental when applied to reasoning tasks in Smaller Language Models (SLMs) with less than 10 billion parameters. To address this limitation, we introduce Dialogue-guided Chain-of-Thought (DialCoT) which employs a dialogue format to generate intermediate reasoning steps, guiding the model toward the final answer. Additionally, we optimize the model's reasoning path selection using the Proximal Policy Optimization (PPO) algorithm, further enhancing its reasoning capabilities. Our method offers several advantages compared to previous approaches. Firstly, we transform the process of solving complex reasoning questions by breaking them down into a series of simpler sub-questions, significantly reducing the task difficulty and making it more suitable for SLMs. Secondly, we optimize the model's reasoning path selection through the PPO algorithm. We conduct comprehensive experiments on four arithmetic reasoning datasets, demonstrating that our method achieves significant performance improvements compared to state-of-the-art competitors.
翻译:思维链提示已被证明能够有效增强具有至少1000亿参数的大语言模型的推理能力。然而,该方法对于参数少于100亿的小型语言模型在处理推理任务时效果不佳甚至适得其反。为解决这一局限,我们提出了对话引导的思维链方法,通过对话格式生成中间推理步骤,引导模型得出最终答案。此外,我们利用近端策略优化算法优化模型的推理路径选择,进一步增强其推理能力。与先前方法相比,我们的方法具有多项优势。首先,我们将复杂推理问题的求解过程分解为一系列更简单的子问题,显著降低任务难度,使其更适合小型语言模型。其次,通过PPO算法优化模型的推理路径选择。我们在四个算术推理数据集上开展了全面实验,结果表明,与最先进的竞争者相比,我们的方法取得了显著的性能提升。