This paper proposes a methodology for generating and perturbing detailed derivations of equations at scale, aided by a symbolic engine, to evaluate the generalisability of Transformers to out-of-distribution mathematical reasoning problems. Instantiating the framework in the context of sequence classification tasks, we compare the capabilities of GPT-4, GPT-3.5, and a canon of fine-tuned BERT models, exploring the relationship between specific operators and generalisation failure via the perturbation of reasoning aspects such as symmetry and variable surface forms. Surprisingly, our empirical evaluation reveals that the average in-distribution performance of fine-tuned models surpasses GPT-3.5, and rivals GPT-4. However, perturbations to input reasoning can reduce their performance by up to 80 F1 points. Overall, the results suggest that the in-distribution performance of smaller open-source models may potentially rival GPT by incorporating appropriately structured derivation dependencies during training, and highlight a shared weakness between BERT and GPT involving a relative inability to decode indirect references to mathematical entities. We release the full codebase, constructed datasets, and fine-tuned models to encourage future progress in the field.
翻译:本文提出一种方法论,借助符号引擎实现大规模数学方程详细推导的生成与扰动,以评估Transformer在分布外数学推理问题中的泛化能力。将该框架实例化至序列分类任务场景后,我们对比了GPT-4、GPT-3.5及一组微调BERT模型的能力差异,通过扰动对称性、变量表层形式等推理维度,探究特定算子与泛化失败之间的关联。令人惊讶的是,实证评估显示微调模型在分布内任务的平均性能不仅超越GPT-3.5,甚至可与GPT-4比肩。然而,对输入推理过程的扰动可使模型性能下降多达80个F1分值。总体结果表明,通过训练时融入结构化推导依赖关系,较小规模开源模型的分布内性能可能潜在媲美GPT,同时揭示了BERT与GPT的共同缺陷——对数学实体间接引用的解码能力相对薄弱。我们已开源完整代码库、构建的数据集及微调模型,以推动该领域的后续发展。