Understanding the theoretical foundations of attention mechanisms remains challenging due to their complex, non-linear dynamics. This work reveals a fundamental trade-off in the learning dynamics of linearized attention. Using a linearized attention mechanism with exact correspondence to a data-dependent Gram-induced kernel, both empirical and theoretical analysis through the Neural Tangent Kernel (NTK) framework shows that linearized attention does not converge to its infinite-width NTK limit, even at large widths. A spectral amplification result establishes this formally: the attention transformation cubes the Gram matrix's condition number, requiring width $m = Ω(κ^6)$ for convergence, a threshold that exceeds any practical width for natural image datasets. This non-convergence is characterized through influence malleability, the capacity to dynamically alter reliance on training examples. Attention exhibits 6--9$\times$ higher malleability than ReLU networks, with dual implications: its data-dependent kernel can reduce approximation error by aligning with task structure, but this same sensitivity increases susceptibility to adversarial manipulation of training data. These findings suggest that attention's power and vulnerability share a common origin in its departure from the kernel regime.
翻译:理解注意力机制的理论基础因其复杂的非线性动力学而仍然具有挑战性。本研究揭示了线性化注意力学习动力学中的一个基本权衡。通过使用一个与数据相关的Gram诱导核具有精确对应关系的线性化注意力机制,基于神经正切核(NTK)框架的实证和理论分析表明,即使在大宽度下,线性化注意力也不会收敛到其无限宽度的NTK极限。一个谱放大结果正式确立了这一点:注意力变换将Gram矩阵的条件数立方化,要求宽度 $m = Ω(κ^6)$ 才能收敛,这一阈值超过了自然图像数据集上任何实际的宽度。这种非收敛性通过影响可塑性——即动态改变对训练样本依赖程度的能力——来表征。注意力的可塑性比ReLU网络高6--9倍,这具有双重含义:其数据相关的核可以通过与任务结构对齐来减少近似误差,但这种相同的敏感性也增加了对训练数据进行对抗性操纵的易感性。这些发现表明,注意力的强大能力与脆弱性具有共同的根源,即其偏离了核机制。