NeuroProlog: Multi-Task Fine-Tuning for Neurosymbolic Mathematical Reasoning via the Cocktail Effect

Large Language Models (LLMs) achieve strong performance on natural language tasks but remain unreliable in mathematical reasoning, frequently generating fluent yet logically inconsistent solutions. We present \textbf{NeuroProlog}, a neurosymbolic framework that ensures verifiable reasoning by compiling math word problems into executable Prolog programs with formal verification guarantees. We propose a multi-task Cocktail training strategy that jointly optimizes three synergistic objectives in a unified symbolic representation space: (i) mathematical formula-to-rule translation (KB), (ii) natural language-to-program synthesis (SOLVE), and (iii) program-answer alignment. This joint supervision enables positive transfer, where symbolic grounding in formula translation directly improves compositional reasoning capabilities. At inference, we introduce an execution-guided decoding pipeline with fine-grained error taxonomy that enables iterative program repair and quantifies model self-debugging capacity. Comprehensive evaluation on GSM8K across four model scales (3B--32B parameters) demonstrates consistent improvements: cocktail training achieves significant accuracy gains of +5.23\% (Qwen-32B, $p < 0.01$), +3.43\% (GPT-OSS-20B, $p < 0.01$), and +5.54\% (Llama-3B, $p < 0.05$) over single-task baselines. Systematic error analysis reveals scale-dependent learning dynamics: at 32B scale, cocktail training transforms unfixable type errors (12\% repair rate) into correctable domain errors (96\% repair rate), achieving 92.7\% overall correction; at 8B scale, the same training eliminates syntactic errors but introduces semantic failures, revealing a critical capacity threshold for type-safe symbolic reasoning.

翻译：大型语言模型（LLM）在自然语言任务上表现出色，但在数学推理方面仍不可靠，经常生成流畅但逻辑不一致的解决方案。我们提出\textbf{NeuroProlog}——一种神经符号框架，通过将数学应用题编译为具有形式化验证保证的可执行Prolog程序，确保可验证的推理。我们提出多任务鸡尾酒训练策略，在统一的符号表示空间中联合优化三个协同目标：（i）数学公式到规则的翻译（KB），（ii）自然语言到程序的合成（SOLVE），以及（iii）程序-答案对齐。这种联合监督实现了正向迁移，其中公式翻译中的符号接地直接提升了组合推理能力。在推理阶段，我们引入了基于执行的解码流程，配备细粒度错误分类法，支持迭代式程序修复并量化模型自调试能力。在GSM8K数据集上对四种模型规模（3B--32B参数）的综合评估显示了一致的改进：鸡尾酒训练相比单任务基线取得了显著准确率提升——Qwen-32B提升+5.23\%（$p < 0.01$）、GPT-OSS-20B提升+3.43\%（$p < 0.01$）、Llama-3B提升+5.54\%（$p < 0.05$）。系统性错误分析揭示了规模依赖的学习动态：在32B规模下，鸡尾酒训练将不可修复的类型错误（12\%修复率）转化为可修正的领域错误（96\%修复率），实现92.7\%的整体修正率；在8B规模下，相同训练消除了语法错误但引入了语义错误，揭示了类型安全符号推理的关键容量阈值。