Chain-of-Thought (CoT) prompting has proven to be effective in enhancing the reasoning capabilities of Large Language Models (LLMs) with at least 100 billion parameters. However, it is ineffective or even detrimental when applied to reasoning tasks in Smaller Language Models (SLMs) with less than 10 billion parameters. To address this limitation, we introduce Dialogue-guided Chain-of-Thought (DialCoT) which employs a dialogue format to generate intermediate reasoning steps, guiding the model toward the final answer. Additionally, we optimize the model's reasoning path selection using the Proximal Policy Optimization (PPO) algorithm, further enhancing its reasoning capabilities. Our method offers several advantages compared to previous approaches. Firstly, we transform the process of solving complex reasoning questions by breaking them down into a series of simpler sub-questions, significantly reducing the task difficulty and making it more suitable for SLMs. Secondly, we optimize the model's reasoning path selection through the PPO algorithm. We conduct comprehensive experiments on four arithmetic reasoning datasets, demonstrating that our method achieves significant performance improvements compared to state-of-the-art competitors.
翻译:思维链提示已被证明能有效增强具有至少1000亿参数的大语言模型的推理能力。然而,对于参数量不足100亿的小语言模型中的推理任务,该方法不仅无效甚至可能产生负面影响。为解决这一局限,我们提出对话引导式思维链方法。该方法通过对话格式生成中间推理步骤,引导模型得出最终答案。此外,我们采用近端策略优化算法优化模型的推理路径选择,进一步增强其推理能力。与传统方法相比,我们的方法具有多重优势:首先,通过将复杂推理问题分解为一系列更简单的子问题,显著降低了任务难度,使其更适用于小语言模型;其次,通过PPO算法优化模型的推理路径选择。我们在四个算术推理数据集上进行了全面实验,结果表明,与最先进的竞品方法相比,我们的方法实现了显著的性能提升。