The derivation of mathematical results in specialised fields, using Large Language Models (LLMs), is an emerging research direction that can help identify models' limitations, and potentially support mathematical discovery. In this paper, we leverage a symbolic engine to generate derivations of equations at scale, and investigate the capabilities of LLMs when deriving goal equations from premises. Specifically, we employ in-context learning for GPT and fine-tune a range of T5 models to compare the robustness and generalisation of pre-training strategies to specialised models. Empirical results show that fine-tuned FLAN-T5-large (MathT5) outperforms GPT models on all static and out-of-distribution test sets in conventional scores. However, an in-depth analysis reveals that the fine-tuned models are more sensitive to perturbations involving unseen symbols and (to a lesser extent) changes to equation structure. In addition, we analyse 1.7K equations, and over 200 derivations, to highlight common reasoning errors such as the inclusion of incorrect, irrelevant, and redundant equations. Finally, we explore the suitability of existing metrics for evaluating mathematical derivations and find evidence that, while they can capture general properties such as sensitivity to perturbations, they fail to highlight fine-grained reasoning errors and essential differences between models. Overall, this work demonstrates that training models on synthetic data may improve their math capabilities beyond much larger LLMs, but current metrics are not appropriately assessing the quality of generated mathematical text.
翻译:在专业领域中,利用大型语言模型(LLMs)生成数学结果是一个新兴的研究方向,有助于识别模型的局限性,并可能支持数学发现。本文借助符号引擎大规模生成方程推导过程,研究LLMs在从前提推导目标方程时的能力。具体而言,我们对GPT采用上下文学习,并微调一系列T5模型,以比较预训练策略对专门化模型的鲁棒性和泛化能力。实验结果表明,在传统评分标准下,微调后的FLAN-T5-large(MathT5)在所有静态和分布外测试集上均优于GPT模型。然而,深入分析显示,微调模型对涉及未见符号的扰动以及(程度较轻的)方程结构变化更为敏感。此外,我们分析了1700个方程和200多个推导过程,指出了常见的推理错误,如包含不正确、不相关和冗余的方程。最后,我们探讨了现有指标在评估数学推导方面的适用性,发现虽然这些指标能捕捉对扰动敏感等一般属性,但未能突出细粒度的推理错误以及模型间的本质差异。总体而言,本文表明,在合成数据上训练模型可能使其数学能力超越规模大得多的LLMs,但当前的指标未能恰当评估生成的数学文本的质量。