We have recently witnessed a number of impressive results on hard mathematical reasoning problems with language models. At the same time, the robustness of these models has also been called into question; recent works have shown that models can rely on shallow patterns in the problem description when generating a solution. Building on the idea of behavioral testing, we propose a novel framework, which pins down the causal effect of various factors in the input, e.g., the surface form of the problem text, the operands, and math operators on the output solution. By grounding the behavioral analysis in a causal graph describing an intuitive reasoning process, we study the behavior of language models in terms of robustness and sensitivity to direct interventions in the input space. We apply our framework on a test bed of math word problems. Our analysis shows that robustness does not appear to continuously improve as a function of size, but the GPT-3 Davinci models (175B) achieve a dramatic improvement in both robustness and sensitivity compared to all other GPT variants.
翻译:我们近年在语言模型处理硬数学推理问题上观察到了诸多令人瞩目的成果。与此同时,这些模型的鲁棒性也受到质疑——近期研究表明,模型在生成解答时可能依赖问题描述中的浅层模式。基于行为测试思想,我们提出一个新型框架,该框架能精确识别输入中各类因素(如问题文本的表层形式、操作数及数学运算符)对输出结果的因果效应。通过将行为分析锚定在描述直观推理过程的因果图上,我们研究了语言模型对输入空间直接干预的鲁棒性与敏感性。我们将该框架应用于数学文字题测试集。分析表明,模型的鲁棒性能并不随规模增大而持续改善,但GPT-3 Davinci模型(175B)在鲁棒性和敏感性方面相比其他GPT变体实现了显著提升。