Although Large Language Models (LLMs) achieve remarkable performance across various tasks, they often struggle with complex reasoning tasks, such as answering mathematical questions. Recent efforts to address this issue have primarily focused on leveraging mathematical datasets through supervised fine-tuning or self-improvement techniques. However, these methods often depend on high-quality datasets that are difficult to prepare, or they require substantial computational resources for fine-tuning. Inspired by findings that LLMs know how to produce the right answer but struggle to select the correct reasoning path, we propose a purely inference-based searching method -- MindStar (M*). This method formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths. We evaluate the M* framework on both the GSM8K and MATH datasets, comparing its performance with existing open and closed-source LLMs. Our results demonstrate that M* significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1, but with substantially reduced model size and computational costs.
翻译:尽管大语言模型(LLMs)在各种任务上取得了显著成就,但在处理复杂推理任务(如解答数学问题)时仍常面临困难。近期针对该问题的研究主要集中于通过监督微调或自改进技术利用数学数据集。然而,这些方法通常依赖于难以制备的高质量数据集,或需要大量计算资源进行微调。受大语言模型"知道如何生成正确答案却难以选择正确推理路径"这一发现的启发,我们提出了一种纯基于推理的搜索方法——MindStar(M*)。该方法将推理任务形式化为搜索问题,并提出两种搜索思路以确定最优推理路径。我们在GSM8K和MATH数据集上评估了M*框架,并将其性能与现有开源及闭源大语言模型进行对比。实验结果表明,M*显著提升了Llama-2-13B和Mistral-7B等开源模型的推理能力,在模型规模和计算成本大幅降低的情况下,达到了与GPT-3.5和Grok-1相媲美的性能。