Large language models achieve strong performance on arithmetic reasoning benchmarks, and one common response to arithmetic brittleness is to delegate computation to code. Yet models are still often used in settings where they must reason directly from natural language, and trustworthy models should solve small-number arithmetic word problems without external tools. Prior work shows that LLMs are sensitive to numerical variation: a model may solve an original problem but fail on structurally similar variants requiring the same reasoning procedure with different numbers. We ask whether this fragility persists under a stricter setting involving small, schema-preserving numeric changes that retain the original reasoning program and avoid large-number stress tests. We introduce an automatic algorithm for generating numeric-remapping attacks on arithmetic word problems. Unlike template-based perturbation methods requiring manual schemas or constraints, our approach derives problem-specific symbolic representations, generates constrained numeric remappings, recomputes gold answers, and realizes transformed questions through deterministic edits guided by LLM-generated edit plans. Stage-wise validation and a high-confidence audit retain reliable attacks, making the pipeline scalable with limited human intervention. We evaluate DeepSeek-R1 (70B), Gemma4 (31B), and GPT-OSS (120B) on GSM8K, MAWPS, and MultiArith. On GSM8K, completed runs show conditional accuracy drops of 12.16 to 25.82 percentage points. MAWPS and MultiArith are far more stable, with most attacked accuracies near or above 98%. These results show that numeric-remapping robustness depends strongly on dataset structure: GSM8K remains sensitive even when reasoning programs are preserved and answers are recomputed, while shorter, more regular datasets are more robust.
翻译:大型语言模型在算术推理基准测试中表现出色,而针对算术脆弱性的一种常见应对方法是将其转换为代码执行。然而,这些模型仍经常被用于需要直接根据自然语言进行推理的场景,且值得信赖的模型应能无需外部工具解决小数值算术应用题。先前研究表明,大语言模型对数值变化敏感:模型可能解决原始问题,但在需要相同推理过程但数值不同的结构相似变体上失败。我们探究在更严格的设定下这种脆弱性是否依然存在,该设定涉及保留原始推理程序且避免大数值压力测试的小规模、模式不变的数值变化。我们引入了一种自动算法,用于生成针对算术应用题的数值重映射攻击。与需要手动模式或约束的基于模板的扰动方法不同,我们的方法推导问题特定的符号表示,生成受约束的数值重映射,重新计算真实答案,并通过由大语言模型生成的编辑计划指导的确定性编辑实现问题转换。分阶段验证和高置信度审计保留了可靠的攻击,使得流水线在有限人工干预下具有可扩展性。我们在GSM8K、MAWPS和MultiArith上评估了DeepSeek-R1(70B)、Gemma4(31B)和GPT-OSS(120B)。在GSM8K上,完整运行显示条件准确率下降12.16至25.82个百分点。MAWPS和MultiArith则更为稳定,大多数攻击后的准确率接近或高于98%。这些结果表明,数值重映射鲁棒性强烈依赖于数据集结构:即使推理程序被保留且答案被重新计算,GSM8K仍然敏感,而更短、更规则的数据集则更为鲁棒。