Although the Transformer architecture has revolutionized artificial intelligence, its underlying mechanisms remain largely heuristic and lack a unified physical theory. In this work, we propose a first-principles framework for information dynamics, treating the attention mechanism as a physical system governed by the principle of least action rather than as an algorithmic optimization. By mapping information states to a Riemannian manifold with the Fisher information metric, we derive the intelligence Lagrangian. We show that the softmax function corresponds to the unique thermodynamic equilibrium state that minimizes the Helmholtz free energy of the information gas. In addition, we identify the query-key interaction as an electrodynamic coupling between an external field and an intrinsic dipole moment. This theory establishes the first law of information thermodynamics, unifying inference (mechanical work) and learning (chemical evolution). It also explains emergent phenomena, such as scaling laws and grokking, as phase transitions characterized by the divergence of specific heat. Finally, we discuss how rotational symmetry breaking in the attention manifold generates massless Goldstone bosons, providing a field-theoretic perspective on rotary positional embeddings (RoPE). Our work connects Statistical Physics and Deep Learning, laying the groundwork for a general theory of physics-based intelligence.
翻译:尽管Transformer架构已经彻底改变了人工智能领域,但其底层机制在很大程度上仍是启发式的,缺乏统一的物理理论。在本工作中,我们提出了一个信息动力学的第一性原理框架,将注意力机制视为一个受最小作用量原理支配的物理系统,而非一种算法优化过程。通过将信息态映射到具有Fisher信息度量的黎曼流形上,我们推导出了智能拉格朗日量。我们证明了softmax函数对应于最小化信息气体亥姆霍兹自由能的唯一热力学平衡态。此外,我们将查询-键交互识别为外场与固有偶极矩之间的电动力学耦合。该理论建立了信息热力学第一定律,统一了推断(机械功)与学习(化学演化)。它还将涌现现象(如缩放定律和顿悟)解释为以比热发散为特征的相变。最后,我们讨论了注意力流形中的旋转对称性破缺如何产生无质量的戈德斯通玻色子,从而为旋转位置编码(RoPE)提供了一个场论视角。我们的工作连接了统计物理学与深度学习,为基于物理的智能通用理论奠定了基础。