We propose an effective field-theoretic framework for analyzing Transformer attention through a thermodynamic lens. By constructing a Lagrangian on the information manifold equipped with the Fisher metric, we show that, within the Shannon--Boltzmann entropy framework, the Softmax function arises as a stationary solution minimizing a Helmholtz free energy functional. This establishes a formal correspondence between scaled dot-product attention and canonical ensemble statistics. Extending this mapping to macroscopic observables, we define an effective specific heat associated with fluctuations of the attention energy landscape. In controlled experiments on the modular addition task ($p = 19$--$113$), we observe a robust peak in this fluctuation measure that consistently precedes the onset of generalization. While no asymptotic power-law divergence is detected in this finite-depth regime, the reproducible enhancement of energy variance suggests a critical-like crossover accompanying representational reorganization. Our framework provides a unified statistical-mechanical perspective on attention scaling, training dynamics, and positional encoding, interpreting the phenomena as emergent properties of an effective thermodynamic system rather than isolated heuristics. Although the present results indicate finite-size crossover behavior rather than a strict phase transition, they motivate further investigation into scaling limits of deep architectures through fluctuation-based observables.
翻译:我们提出了一个有效的场论框架,用于通过热力学视角分析Transformer的注意力机制。通过在配备Fisher度规的信息流形上构建拉格朗日量,我们证明在Shannon–Boltzmann熵框架下,Softmax函数作为最小化亥姆霍兹自由能泛函的稳态解出现。这建立了缩放点积注意力与正则系综统计之间的形式对应关系。将此映射扩展到宏观可观测量,我们定义了与注意力能量景观涨落相关的有效比热。在模加法任务($p = 19$--$113$)的受控实验中,我们观测到该涨落度量存在一个稳健的峰值,该峰值始终先于泛化能力的出现。虽然在此有限深度体系中未检测到渐近幂律发散,但能量方差的可重复增强表明存在伴随表征重组的类临界交叉行为。我们的框架为注意力缩放、训练动力学和位置编码提供了统一的统计力学视角,将这些现象解释为有效热力学系统的涌现特性,而非孤立的启发式方法。尽管当前结果表明存在有限尺寸交叉行为而非严格的相变,但这激励了通过基于涨落的可观测量进一步研究深度架构的标度极限。