突破浮点墙：使用BigInt Transformer在$10^{50}$尺度验证数学定律 (Overcoming the Float Wall: Verifying Mathematical Laws at $10^{50}$ Scale with BigInt Transformers)

from arxiv, v2: Major revision. Added 'Future Work' section on Neural Symbolic Regression. Revised Title and Abstract to highlight 'Float Wall' precision limits and explicitly detail findings on Physics Generalization (Double Pendulum)

A central question in artificial intelligence is whether models learn universal laws or merely memorize statistical heuristics. This distinction is particularly critical in scientific computing, where approximation errors are unacceptable. I investigate this by training models on the Pythagorean Theorem ($a^2+b^2=c^2$) using a massive dataset of $10^{10}$ samples. I identify a fundamental barrier I term the "Float Wall" ($N > 10^{16}$): the point where IEEE 754 double-precision arithmetic fails to distinguish integers, causing standard loss functions to collapse. To overcome this, I adopt a BigInt-native approach, treating numbers as symbolic sequences of digits rather than continuous approximate values. My results reveal a stark dichotomy. Statistical models (Gradient Boosted Decision Trees), despite seeing $10^{10}$ examples, failed to generalize beyond the training range, memorizing local manifolds rather than the underlying law. In contrast, my Arithmetic Transformer, trained on fewer than $10^3$ samples, successfully extrapolated the Pythagorean theorem to cosmic scales ($N \approx 10^{50}$). However, limits remain: in continuous physics tasks (Double Pendulum), while the model correctly identified causal structures, it struggled with high-entropy chaotic states and fine-grained perturbations. This suggests that while symbolic tokenization solves the precision problem for discrete algebra, bridging the gap to continuous dynamics remains an open challenge.

翻译：人工智能的一个核心问题是模型学习的是普适定律还是仅仅记住了统计启发式规则。这一区分在科学计算中尤为关键，因为近似误差是不可接受的。我通过使用包含$10^{10}$个样本的大规模数据集训练模型学习勾股定理（$a^2+b^2=c^2$）来研究此问题。我发现了一个我称之为“浮点墙”（$N > 10^{16}$）的基本障碍：即IEEE 754双精度浮点运算无法区分整数，导致标准损失函数失效的临界点。为克服此障碍，我采用了一种BigInt原生方法，将数字视为数字符号序列而非连续的近似值。我的结果揭示了一个鲜明的二分现象。统计模型（梯度提升决策树）尽管看到了$10^{10}$个示例，却无法泛化到训练范围之外，仅记住了局部流形而非底层定律。相比之下，我的算术Transformer在少于$10^3$个样本上训练后，成功地将勾股定理外推至宇宙尺度（$N \approx 10^{50}$）。然而，限制依然存在：在连续物理任务（双摆）中，模型虽能正确识别因果结构，但在处理高熵混沌状态和精细扰动时表现不佳。这表明，尽管符号化分词解决了离散代数的精度问题，但弥合其与连续动力学之间的差距仍是一个开放的挑战。