Large Language Models (LLMs) often struggle with complex mathematical reasoning, where prose-based generation leads to unverified and arithmetically unsound solutions. Current prompting strategies like Chain of Thought still operate within this unreliable medium, lacking a mechanism for deterministic verification. To address these limitations, we introduce SymCode, a neurosymbolic framework that reframes mathematical problem-solving as a task of verifiable code generation using the SymPy library. We evaluate SymCode on challenging benchmarks, including MATH-500 and OlympiadBench, demonstrating significant accuracy improvements of up to 13.6 percentage points over baselines. Our analysis shows that SymCode is not only more token-efficient but also fundamentally shifts model failures from opaque logical fallacies towards transparent, programmatic errors. By grounding LLM reasoning in a deterministic symbolic engine, SymCode represents a key step towards more accurate and trustworthy AI in formal domains.
翻译:大型语言模型(LLM)在处理复杂数学推理时常面临困难,其基于自然语言的生成方式往往产生未经验证且算术不可靠的解决方案。当前诸如思维链等提示策略仍在这一不可靠的媒介中运行,缺乏确定性验证机制。为应对这些局限性,我们提出SymCode——一种神经符号化框架,该框架将数学问题求解重构为使用SymPy库的可验证代码生成任务。我们在MATH-500和OlympiadBench等具有挑战性的基准测试中评估SymCode,结果显示其准确率较基线方法提升高达13.6个百分点。分析表明,SymCode不仅具有更高的标记效率,更重要的是将模型失败模式从隐晦的逻辑谬误根本性地转向透明的程序化错误。通过将LLM推理建立在确定性符号引擎之上,SymCode标志着形式化领域向更准确、更可信赖的人工智能迈出了关键一步。