Chain-of-Thought (CoT) prompting can enhance the reasoning capabilities of large language models (LLMs), establishing itself as a primary approach to solving complex reasoning tasks. Existing CoT synthesis approaches usually focus on simpler reasoning tasks and thus result in low-quality and inconsistent CoT prompts. In response to this challenge, we present an empirical investigation of CoT prompting and introduce CoTGenius, a novel framework designed for the automatic generation of superior CoT prompts. CoTGenius is developed based on three major evolution strategies, i.e., complicate, diversify, and specify-alongside two filtering mechanisms: evolutionary success judgement and correctness verification. We further employ CoTGenius to create an extensive CoT dataset, and subsequently fine-tune the Llama 2-Chat 7B and 13B models on this dataset. We call the resulting model ChainLM. To deal with the cumulative error issue in reasoning steps, we propose a step-level debating method, wherein multiple debaters discuss each reasoning step to arrive at the correct answer. Extensive experiments demonstrate that our ChainLM models exhibit enhanced proficiency in addressing a spectrum of complex reasoning problems compared to existing models. In addition, we conduct an in-depth analysis of the impact of data categories within CoTGenius on the model performance. We release our dataset and code at https://github.com/RUCAIBox/ChainLM.
翻译:思维链提示能增强大型语言模型的推理能力,已成为解决复杂推理任务的主要方法。现有思维链合成方法通常聚焦于较简单的推理任务,导致生成的思维链提示质量低且不一致。为应对这一挑战,我们对思维链提示进行实证研究,并提出CoTGenius——一个用于自动生成优质思维链提示的新颖框架。CoTGenius基于三大演化策略开发,即复杂化、多样化和具体化,同时辅以两种过滤机制:演化成功判断与正确性验证。我们进一步利用CoTGenius构建大规模思维链数据集,并基于该数据集微调Llama 2-Chat 7B和13B模型,将所得模型命名为ChainLM。针对推理步骤中的累积误差问题,我们提出一种步骤级辩论方法,由多个辩论者对每一步推理进行讨论以得出正确答案。大量实验表明,与现有模型相比,我们的ChainLM模型在处理一系列复杂推理问题方面展现出更强的能力。此外,我们深入分析了CoTGenius中数据类别对模型性能的影响。相关数据集与代码已发布在https://github.com/RUCAIBox/ChainLM。