We study whether a Large Language Model can learn the deterministic sequence of trees generated by the iterated prime factorization of the natural numbers. Each integer is mapped into a rooted planar tree and the resulting sequence $ \mathbb{N}\mathcal{T}$ defines an arithmetic text with measurable statistical structure. A transformer network (the GPT-2 architecture) is trained from scratch on the first $10^{11}$ elements to subsequently test its predictive ability under next-word and masked-word prediction tasks. Our results show that the model partially learns the internal grammar of $\mathbb{N}\mathcal{T}$, capturing non-trivial regularities and correlations. This suggests that learnability may extend beyond empirical data to the very structure of arithmetic.
翻译:本研究探讨大型语言模型能否学习由自然数迭代质因数分解生成的确定性树序列。每个整数被映射为一棵有根平面树,生成的序列$\\mathbb{N}\\mathcal{T}$构成具有可测量统计结构的算术文本。我们使用Transformer网络(GPT-2架构)从零开始训练前$10^{11}$个元素,随后通过下一词预测和掩码词预测任务测试其预测能力。结果表明,该模型部分习得了$\\mathbb{N}\\mathcal{T}$的内部语法,能够捕捉非平凡的规律性与相关性。这暗示可学习性可能超越经验数据,延伸至算术结构本身。