Softmax attention is a central component of transformer architectures, yet its nonlinear structure poses significant challenges for theoretical analysis. We develop a unified, measure-based framework for studying single-layer softmax attention under both finite and infinite prompts. For i.i.d. Gaussian inputs, we lean on the fact that the softmax operator converges in the infinite-prompt limit to a linear operator acting on the underlying input-token measure. Building on this insight, we establish non-asymptotic concentration bounds for the output and gradient of softmax attention, quantifying how rapidly the finite-prompt model approaches its infinite-prompt counterpart, and prove that this concentration remains stable along the entire training trajectory in general in-context learning settings with sub-Gaussian tokens. In the case of in-context linear regression, we use the tractable infinite-prompt dynamics to analyze training at finite prompt length. Our results allow optimization analyses developed for linear attention to transfer directly to softmax attention when prompts are sufficiently long, showing that large-prompt softmax attention inherits the analytical structure of its linear counterpart. This, in turn, provides a principled and broadly applicable toolkit for studying the training dynamics and statistical behavior of softmax attention layers in large prompt regimes.
翻译:Softmax注意力是Transformer架构的核心组件,但其非线性结构给理论分析带来了重大挑战。我们建立了一个统一的、基于测度的框架,用于研究有限与无限提示场景下的单层Softmax注意力。对于独立同分布的高斯输入,我们利用Softmax算子在无限提示极限下收敛为作用于底层输入-令牌测度的线性算子这一事实。基于此发现,我们建立了Softmax注意力输出与梯度的非渐近浓度界,量化了有限提示模型逼近其无限提示对应模型的速度,并证明在具有次高斯令牌的通用上下文学习设置中,该浓度性在整个训练轨迹上保持稳定。在上下文线性回归案例中,我们利用可处理的无限提示动力学来分析有限提示长度下的训练。我们的结果表明,当提示足够长时,为线性注意力开发的优化分析可直接迁移至Softmax注意力,显示大提示场景下的Softmax注意力继承了其线性对应物的分析结构。这进而为大提示场景下Softmax注意力层的训练动力学与统计行为研究提供了原理性且广泛适用的工具包。