MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs

Large language models (LLMs) can solve arithmetic word problems with high accuracy, but little is known about how well they generalize to more complex problems. This is difficult to study, as (i) much of the available evaluation data has already been seen by the most capable models during training, and (ii) existing benchmarks do not capture how problem proofs may be arbitrarily complex in various ways. In this paper, we present a data-generation framework for evaluating LLMs on problems with arbitrarily complex arithmetic proofs, called MathGAP. MathGAP generates problem statements and chain-of-thought reasoning traces according to specifications about their arithmetic proof structure, enabling systematic studies on easy-to-hard generalization with respect to complexity of proof trees. Using MathGAP, we find that LLMs show a significant decrease in performance as proofs get deeper and wider. This effect is more pronounced in complex, nonlinear proof structures, which are challenging even for the most capable models. The models are also sensitive to simple changes in sentence ordering. However, they remain capable of solving some complex problems, suggesting that reasoning generalization is noisy.

翻译：大型语言模型（LLMs）能够以较高准确率解决算术应用题，但其对更复杂问题的泛化能力尚不明确。这一研究面临双重困难：（i）现有评估数据大多已被当前最强模型在训练阶段接触过；（ii）现有基准测试未能涵盖问题证明可能以多种方式呈现任意复杂性的特点。本文提出一种针对任意复杂算术证明问题的LLMs评估数据生成框架MathGAP。该框架可根据算术证明结构的规范要求，自动生成问题描述及思维链推理轨迹，从而支持针对证明树复杂度的由易到难泛化系统性研究。通过MathGAP实验发现：当证明结构变得更深更宽时，LLMs性能出现显著下降；对于复杂的非线性证明结构，这种效应尤为明显，即使最先进的模型也面临挑战。模型对语句顺序的简单调整同样敏感。然而，它们仍能解决部分复杂问题，这表明推理泛化过程具有随机性特征。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日