Testing LLM Arithmetic Reasoning Generalization with Automatic Numeric-Remapping Attacks

Large language models achieve strong performance on arithmetic reasoning benchmarks, and one common response to arithmetic brittleness is to delegate computation to code. Yet models are still often used in settings where they must reason directly from natural language, and trustworthy models should solve small-number arithmetic word problems without external tools. Prior work shows that LLMs are sensitive to numerical variation: a model may solve an original problem but fail on structurally similar variants requiring the same reasoning procedure with different numbers. We ask whether this fragility persists under a stricter setting involving small, schema-preserving numeric changes that retain the original reasoning program and avoid large-number stress tests. We introduce an automatic algorithm for generating numeric-remapping attacks on arithmetic word problems. Unlike template-based perturbation methods requiring manual schemas or constraints, our approach derives problem-specific symbolic representations, generates constrained numeric remappings, recomputes gold answers, and realizes transformed questions through deterministic edits guided by LLM-generated edit plans. Stage-wise validation and a high-confidence audit retain reliable attacks, making the pipeline scalable with limited human intervention. We evaluate DeepSeek-R1 (70B), Gemma4 (31B), and GPT-OSS (120B) on GSM8K, MAWPS, and MultiArith. On GSM8K, completed runs show conditional accuracy drops of 12.16 to 25.82 percentage points. MAWPS and MultiArith are far more stable, with most attacked accuracies near or above 98%. These results show that numeric-remapping robustness depends strongly on dataset structure: GSM8K remains sensitive even when reasoning programs are preserved and answers are recomputed, while shorter, more regular datasets are more robust.

翻译：大型语言模型在算术推理基准测试中表现出色，而针对算术脆弱性的一种常见应对方法是将其转换为代码执行。然而，这些模型仍经常被用于需要直接根据自然语言进行推理的场景，且值得信赖的模型应能无需外部工具解决小数值算术应用题。先前研究表明，大语言模型对数值变化敏感：模型可能解决原始问题，但在需要相同推理过程但数值不同的结构相似变体上失败。我们探究在更严格的设定下这种脆弱性是否依然存在，该设定涉及保留原始推理程序且避免大数值压力测试的小规模、模式不变的数值变化。我们引入了一种自动算法，用于生成针对算术应用题的数值重映射攻击。与需要手动模式或约束的基于模板的扰动方法不同，我们的方法推导问题特定的符号表示，生成受约束的数值重映射，重新计算真实答案，并通过由大语言模型生成的编辑计划指导的确定性编辑实现问题转换。分阶段验证和高置信度审计保留了可靠的攻击，使得流水线在有限人工干预下具有可扩展性。我们在GSM8K、MAWPS和MultiArith上评估了DeepSeek-R1（70B）、Gemma4（31B）和GPT-OSS（120B）。在GSM8K上，完整运行显示条件准确率下降12.16至25.82个百分点。MAWPS和MultiArith则更为稳定，大多数攻击后的准确率接近或高于98%。这些结果表明，数值重映射鲁棒性强烈依赖于数据集结构：即使推理程序被保留且答案被重新计算，GSM8K仍然敏感，而更短、更规则的数据集则更为鲁棒。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【ICML2025】通过多智能体反思强化大语言模型推理

专知会员服务

23+阅读 · 2025年6月11日

【ICML2025】MARGE：通过引导式探索提升大型语言模型的数学推理能力

专知会员服务

9+阅读 · 2025年5月20日

142页DeepSeek-R1 思维链技术：让我们一起<思考>大语言模型（LLM）的推理能力

专知会员服务

48+阅读 · 2025年4月12日

如何提升大模型通用推理能力？DeepSeek最新论文《CODEI/O：通过代码输入输出预测凝练推理模式》

专知会员服务

42+阅读 · 2025年2月16日