From Large to Tiny: Distilling and Refining Mathematical Expertise for Math Word Problems with Weakly Supervision

Addressing the challenge of high annotation costs in solving Math Word Problems (MWPs) through full supervision with intermediate equations, recent works have proposed weakly supervised task settings that rely solely on the final answer as a supervised signal. Existing leading approaches typically employ various search techniques to infer intermediate equations, but cannot ensure their semantic consistency with natural language descriptions. The rise of Large Language Models (LLMs) like ChatGPT has opened up new possibilities for addressing MWPs directly. However, the computational demands of LLMs make them less than ideal for use in settings where resources are tight. In light of these challenges, we introduce an innovative two-stage framework that adeptly transfers mathematical Expertise from large to tiny language models. In \emph{Distillation Stage}, we propose a series of extraction processes that satisfy the properties of MWPs to distill mathematical knowledge from LLMs to construct problem-equation pairs required for supervised training. In \emph{Refinement Stage}, Due to Knowledge distilling method cannot guarantee the full utilization of all data, we further utilize the unsuccessfully searched data effectively by Knowledge Refine method. Finally, We train a small model using distilled data generated through two-stage methods. As our method fully leverages the semantic understanding capabilities during the searching 'problem-equation' pair, it demonstrates significantly improved performance on the Math23K and Weak12K datasets compared to existing small model methods, while maintaining a much lower computational cost than ChatGPT.

翻译：针对通过中间方程实现完全监督解决数学应用题（MWP）的高标注成本挑战，近期研究提出了仅依赖最终答案作为监督信号的弱监督任务设置。现有主流方法通常采用多种搜索技术推断中间方程，但无法确保其与自然语言描述的语义一致性。ChatGPT等大型语言模型（LLM）的兴起为直接处理MWP开辟了新可能，然而LLM的高计算需求使其在资源受限场景中不够理想。针对这些挑战，我们提出一个创新性两阶段框架，巧妙地将数学专长从大型语言模型迁移至小型语言模型。在**蒸馏阶段**，我们提出一系列满足MWP特性的提取流程，从LLM中蒸馏数学知识，构建监督训练所需的问题-方程对。在**精炼阶段**，因知识蒸馏方法无法保证全部数据的充分利用，我们通过知识精炼方法有效利用未成功搜索的数据。最终，利用两阶段方法生成的蒸馏数据训练小型模型。由于我们的方法在搜索"问题-方程"对过程中充分发挥了语义理解能力，相较现有小型模型方法，其在Math23K和Weak12K数据集上展现出显著提升的性能，同时保持远低于ChatGPT的计算成本。