Large language models (LLMs) have shown promising first-order logic (FOL) reasoning capabilities with applications in various areas. However, their effectiveness in complex mathematical reasoning involving multi-step FOL deductions is still under-researched. While LLMs perform competitively on established mathematical reasoning benchmarks, they struggle with multi-step FOL tasks, as demonstrated by Deepseek-Prover-V2-7B's low accuracy (4.2%) on our proposed theorem proving dataset. This issue arises from the limited exploration of diverse proof strategies and the potential for early reasoning mistakes to undermine entire proofs. To address these issues, we propose DREAM, a self-adaptive solution that enhances the Diversity and REAsonability of LLMs' generation strategies. DREAM incorporates an Axiom-Driven Strategy Diversification mechanism to promote varied strategic outcomes and a Sub-Proposition Error Feedback to help LLMs reflect on and correct their proofs. Our contributions include pioneering advancements in LLMs' mathematical reasoning through FOL theorem proving, introducing a novel inference stage solution that improves performance by 0.6% to 6.4%, and providing a curated dataset of 447 mathematical theorems in Lean 4 format for evaluation.
翻译:大语言模型在涉及一阶逻辑推理的各类应用中展现出潜力,但其处理包含多步一阶逻辑推演的复杂数学推理能力仍待深入探索。尽管大语言模型在现有数学推理基准测试中表现优异,但其在多步一阶逻辑任务中仍存在明显不足——Deepseek-Prover-V2-7B在我们提出的定理证明数据集上准确率仅为4.2%。该问题的根源在于:推理策略多样性探索不足,且早期推理错误可能导致整个证明失效。为此,我们提出DREAM自适应解决方案,旨在增强大语言模型生成策略的多样性与合理性。该方案通过公理驱动策略多样化机制促进策略多样性,并引入子命题错误反馈机制帮助模型反思修正证明过程。本研究的主要贡献包括:首次通过一阶逻辑定理证明推动大语言模型数学推理能力发展,提出可将性能提升0.6%-6.4%的创新推理阶段解决方案,以及构建了包含447个数学定理的Lean 4格式评估数据集。