The theoretical understanding of self-attention (SA) has been steadily progressing. A prominent line of work studies a class of SA layers that admit an energy function decreased by state updates. While it provides valuable insights into inherent biases in signal propagation, it often relies on idealized assumptions or additional constraints not necessarily present in standard SA. Thus, to broaden our understanding, this work aims to relax these energy constraints and provide an energy-agnostic characterization of inference dynamics by dynamical systems analysis. In more detail, we first consider relaxing the symmetry and single-head constraints traditionally required in energy-based formulations. Next, we show that analyzing the Jacobian matrix of the state is highly valuable when investigating more general SA architectures without necessarily admitting an energy function. It reveals that the normalization layer plays an essential role in suppressing the Lipschitzness of SA and the Jacobian's complex eigenvalues, which correspond to the oscillatory components of the dynamics. In addition, the Lyapunov exponents computed from the Jacobians demonstrate that the normalized dynamics lie close to a critical state, and this criticality serves as a strong indicator of high inference performance. Furthermore, the Jacobian perspective also enables us to develop regularization methods for training and a pseudo-energy for monitoring inference dynamics.
翻译:自注意力机制的理论理解正在稳步推进。一类重要研究工作聚焦于分析具有能量函数的自注意力层,其状态更新过程会降低能量值。尽管这类研究为信号传播中的内在偏置提供了有价值的见解,但其通常依赖于理想化假设或额外约束条件,而这些条件在标准自注意力中未必存在。因此,为拓宽我们的理解,本研究旨在放宽这些能量约束,并通过动力系统分析提供一种能量不可知的推理动力学表征。具体而言,我们首先考虑放宽传统基于能量的公式所要求的对称性和单头注意力约束。接着,我们证明在分析不必然具有能量函数的更广义自注意力架构时,对状态雅可比矩阵的研究具有重要价值。分析表明,归一化层在抑制自注意力的利普希茨连续性及雅可比矩阵的复特征值方面起着关键作用,这些复特征值对应动力学中的振荡分量。此外,根据雅可比矩阵计算的李雅普诺夫指数表明,归一化动力学处于接近临界状态的位置,而该临界性可作为高推理性能的强指示因子。进一步地,雅可比矩阵视角还使我们能够开发用于训练的正则化方法,以及用于监测推理动力学的伪能量函数。