We propose a variational framework that interprets transformer layers as iterations of an optimization algorithm acting on token embeddings. In this view, self-attention implements a gradient step of an interaction energy, while MLP layers correspond to gradient updates of a potential energy. Standard GPT-style transformers emerge as vanilla gradient descent on the resulting composite objective, implemented via Lie--Trotter splitting between these two energy functionals. This perspective enables principled architectural design using classical optimization ideas. As a proof of concept, we introduce a Nesterov-style accelerated transformer that preserves the same attention and MLP oracles. The resulting architecture consistently outperforms a nanoGPT baseline on TinyStories and OpenWebText, demonstrating that optimization-theoretic insights can translate into practical gains.
翻译:我们提出了一种变分框架,将Transformer层解释为作用于词元嵌入的优化算法迭代。在此视角下,自注意力机制实现了交互能量的梯度步进,而多层感知机层则对应于势能的梯度更新。标准的GPT风格Transformer可视为对所得复合目标函数进行普通梯度下降的结果,通过李-特罗特分裂法在这两个能量泛函之间实现。该视角使得利用经典优化思想进行体系结构设计成为可能。作为概念验证,我们提出了一种涅斯捷罗夫风格的加速Transformer,其保持了相同的注意力与多层感知机算子。所得架构在TinyStories和OpenWebText数据集上持续优于nanoGPT基线,证明了优化理论见解能够转化为实际性能提升。