Generating Mathematical Derivations with Large Language Models

The derivation of mathematical results in specialised fields, using Large Language Models (LLMs), is an emerging research direction that can help identify models' limitations, and potentially support mathematical discovery. In this paper, we leverage a symbolic engine to generate derivations of equations at scale, and investigate the capabilities of LLMs when deriving goal equations from premises. Specifically, we employ in-context learning for GPT and fine-tune a range of T5 models to compare the robustness and generalisation of pre-training strategies to specialised models. Empirical results show that fine-tuned FLAN-T5-large (MathT5) outperforms GPT models on all static and out-of-distribution test sets in conventional scores. However, an in-depth analysis reveals that the fine-tuned models are more sensitive to perturbations involving unseen symbols and (to a lesser extent) changes to equation structure. In addition, we analyse 1.7K equations, and over 200 derivations, to highlight common reasoning errors such as the inclusion of incorrect, irrelevant, and redundant equations. Finally, we explore the suitability of existing metrics for evaluating mathematical derivations and find evidence that, while they can capture general properties such as sensitivity to perturbations, they fail to highlight fine-grained reasoning errors and essential differences between models. Overall, this work demonstrates that training models on synthetic data may improve their math capabilities beyond much larger LLMs, but current metrics are not appropriately assessing the quality of generated mathematical text.

翻译：在专业领域中，利用大型语言模型（LLMs）生成数学结果是一个新兴的研究方向，有助于识别模型的局限性，并可能支持数学发现。本文借助符号引擎大规模生成方程推导过程，研究LLMs在从前提推导目标方程时的能力。具体而言，我们对GPT采用上下文学习，并微调一系列T5模型，以比较预训练策略对专门化模型的鲁棒性和泛化能力。实验结果表明，在传统评分标准下，微调后的FLAN-T5-large（MathT5）在所有静态和分布外测试集上均优于GPT模型。然而，深入分析显示，微调模型对涉及未见符号的扰动以及（程度较轻的）方程结构变化更为敏感。此外，我们分析了1700个方程和200多个推导过程，指出了常见的推理错误，如包含不正确、不相关和冗余的方程。最后，我们探讨了现有指标在评估数学推导方面的适用性，发现虽然这些指标能捕捉对扰动敏感等一般属性，但未能突出细粒度的推理错误以及模型间的本质差异。总体而言，本文表明，在合成数据上训练模型可能使其数学能力超越规模大得多的LLMs，但当前的指标未能恰当评估生成的数学文本的质量。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日