Attention mechanisms have revolutionized several domains of artificial intelligence, such as natural language processing and computer vision, by enabling models to selectively focus on relevant parts of the input data. While recent work has characterized the optimization dynamics of gradient descent (GD) in attention-based models and the structural properties of its preferred solutions, less is known about more general optimization algorithms such as mirror descent (MD). In this paper, we investigate the convergence properties and implicit biases of a family of MD algorithms tailored for softmax attention mechanisms, with the potential function chosen as the $p$-th power of the $\ell_p$-norm. Specifically, we show that these algorithms converge in direction to a generalized hard-margin SVM with an $\ell_p$-norm objective when applied to a classification problem using a softmax attention model. Notably, our theoretical results reveal that the convergence rate is comparable to that of traditional GD in simpler models, despite the highly nonlinear and nonconvex nature of the present problem. Additionally, we delve into the joint optimization dynamics of the key-query matrix and the decoder, establishing conditions under which this complex joint optimization converges to their respective hard-margin SVM solutions. Lastly, our numerical experiments on real data demonstrate that MD algorithms improve generalization over standard GD and excel in optimal token selection.
翻译:注意力机制通过使模型能够选择性地聚焦于输入数据的相关部分,彻底变革了人工智能的多个领域,如自然语言处理和计算机视觉。尽管近期研究已描述了基于注意力的模型中梯度下降(GD)的优化动态及其偏好解的结构特性,但对于镜像下降(MD)等更通用的优化算法,人们知之甚少。本文研究了一类专为softmax注意力机制设计的MD算法的收敛特性与隐式偏好,其中势函数选为$\ell_p$范数的$p$次幂。具体而言,我们证明当应用于使用softmax注意力模型的分类问题时,这些算法在方向上收敛于具有$\ell_p$范数目标的广义硬间隔支持向量机。值得注意的是,尽管当前问题具有高度非线性和非凸性质,我们的理论结果表明其收敛速率与较简单模型中传统GD的收敛速率相当。此外,我们深入探究了键-查询矩阵与解码器的联合优化动态,建立了这一复杂联合优化收敛至各自硬间隔支持向量机解的条件。最后,我们在真实数据上的数值实验表明,MD算法相比标准GD提升了泛化能力,并在最优标记选择方面表现卓越。