Modern AI systems have been successfully deployed to win medals at international math competitions, assist with research workflows, and prove novel technical lemmas. However, despite their progress at advanced levels of mathematics, they remain stubbornly bad at basic arithmetic, consistently failing on the simple task of adding two numbers. We present a systematic investigation of this phenomenon. We demonstrate empirically that all frontier models suffer significantly degraded accuracy for integer addition as the number of digits increases. Furthermore, we show that most errors made by these models are highly interpretable and can be attributed to either operand misalignment or a failure to correctly carry; these two error classes explain 87.9%, 62.9%, and 92.4% of Claude Opus 4.1, GPT-5, and Gemini 2.5 Pro errors, respectively. Finally, we show that misalignment errors are frequently related to tokenization, and that carrying errors appear largely as independent random failures.
翻译:现代AI系统已成功部署于在国际数学竞赛中赢得奖牌、协助研究工作流程以及证明新颖技术引理等任务。然而,尽管它们在高等数学层面取得了进展,这些系统在基础算术运算上却依然表现不佳,在简单的两数相加任务中持续出错。本文对这一现象进行了系统性研究。我们通过实证证明,所有前沿模型在整数加法运算中的准确率均随数字位数的增加而显著下降。进一步地,我们发现这些模型产生的大多数错误具有高度可解释性,可归因于操作数错位或进位计算失败:这两类错误分别解释了Claude Opus 4.1、GPT-5和Gemini 2.5 Pro模型中87.9%、62.9%和92.4%的错误案例。最后,我们证明错位错误常与分词机制相关,而进位错误则主要表现为独立的随机性失效。