Although Large Language Models (LLMs) achieve remarkable performance across various tasks, they often struggle with complex reasoning tasks, such as answering mathematical questions. Recent efforts to address this issue have primarily focused on leveraging mathematical datasets through supervised fine-tuning or self-improvement techniques. However, these methods often depend on high-quality datasets that are difficult to prepare, or they require substantial computational resources for fine-tuning. Inspired by findings that LLMs know how to produce the right answer but struggle to select the correct reasoning path, we propose a purely inference-based searching method -- MindStar (M*). This method formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths. We evaluate the M* framework on both the GSM8K and MATH datasets, comparing its performance with existing open and closed-source LLMs. Our results demonstrate that M* significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1, but with substantially reduced model size and computational costs.
翻译:尽管大语言模型(LLM)在各种任务上取得了显著成就,但其在处理复杂推理任务(例如回答数学问题)时仍常面临困难。近期解决此问题的努力主要集中在通过监督微调或自改进技术利用数学数据集。然而,这些方法通常依赖于难以准备的高质量数据集,或者需要大量计算资源进行微调。受LLM知道如何生成正确答案但难以选择正确推理路径这一发现的启发,我们提出了一种纯基于推理的搜索方法——MindStar(M*)。该方法将推理任务形式化为搜索问题,并提出了两种搜索思路以识别最优推理路径。我们在GSM8K和MATH数据集上评估了M*框架,并将其性能与现有的开源及闭源LLM进行了比较。结果表明,M*显著增强了开源模型(如Llama-2-13B和Mistral-7B)的推理能力,并实现了与GPT-3.5和Grok-1相当的性能,同时大幅减少了模型规模和计算成本。