Large Language Models (LLMs) achieve strong performance on natural language tasks but remain unreliable in mathematical reasoning, frequently generating fluent yet logically inconsistent solutions. We present \textbf{NeuroProlog}, a neurosymbolic framework that ensures verifiable reasoning by compiling math word problems into executable Prolog programs with formal verification guarantees. We propose a multi-task Cocktail training strategy that jointly optimizes three synergistic objectives in a unified symbolic representation space: (i) mathematical formula-to-rule translation (KB), (ii) natural language-to-program synthesis (SOLVE), and (iii) program-answer alignment. This joint supervision enables positive transfer, where symbolic grounding in formula translation directly improves compositional reasoning capabilities. At inference, we introduce an execution-guided decoding pipeline with fine-grained error taxonomy that enables iterative program repair and quantifies model self-debugging capacity. Comprehensive evaluation on GSM8K across four model scales (3B--32B parameters) demonstrates consistent improvements: cocktail training achieves significant accuracy gains of +5.23\% (Qwen-32B, $p < 0.01$), +3.43\% (GPT-OSS-20B, $p < 0.01$), and +5.54\% (Llama-3B, $p < 0.05$) over single-task baselines. Systematic error analysis reveals scale-dependent learning dynamics: at 32B scale, cocktail training transforms unfixable type errors (12\% repair rate) into correctable domain errors (96\% repair rate), achieving 92.7\% overall correction; at 8B scale, the same training eliminates syntactic errors but introduces semantic failures, revealing a critical capacity threshold for type-safe symbolic reasoning.
翻译:大型语言模型(LLM)在自然语言任务上表现出色,但在数学推理方面仍不可靠,经常生成流畅但逻辑不一致的解决方案。我们提出\textbf{NeuroProlog}——一种神经符号框架,通过将数学应用题编译为具有形式化验证保证的可执行Prolog程序,确保可验证的推理。我们提出多任务鸡尾酒训练策略,在统一的符号表示空间中联合优化三个协同目标:(i)数学公式到规则的翻译(KB),(ii)自然语言到程序的合成(SOLVE),以及(iii)程序-答案对齐。这种联合监督实现了正向迁移,其中公式翻译中的符号接地直接提升了组合推理能力。在推理阶段,我们引入了基于执行的解码流程,配备细粒度错误分类法,支持迭代式程序修复并量化模型自调试能力。在GSM8K数据集上对四种模型规模(3B--32B参数)的综合评估显示了一致的改进:鸡尾酒训练相比单任务基线取得了显著准确率提升——Qwen-32B提升+5.23\%($p < 0.01$)、GPT-OSS-20B提升+3.43\%($p < 0.01$)、Llama-3B提升+5.54\%($p < 0.05$)。系统性错误分析揭示了规模依赖的学习动态:在32B规模下,鸡尾酒训练将不可修复的类型错误(12\%修复率)转化为可修正的领域错误(96\%修复率),实现92.7\%的整体修正率;在8B规模下,相同训练消除了语法错误但引入了语义错误,揭示了类型安全符号推理的关键容量阈值。