While attention has been empirically shown to improve model performance, it lacks a rigorous mathematical justification. This short paper establishes a novel connection between attention mechanisms and multinomial regression. Specifically, we show that in a fixed multinomial regression setting, optimizing over latent features yields solutions that align with the dynamics induced on features by attention blocks. In other words, the evolution of representations through a transformer can be interpreted as a trajectory that recovers the optimal features for classification.
翻译:尽管注意力机制在经验上已被证明能提升模型性能,但其缺乏严格的数学论证。本文在注意力机制与多项回归之间建立了一种新颖的理论联系。具体而言,我们证明在固定的多项回归设定下,对隐特征进行优化所得的解,与注意力模块作用于特征所产生的动态变化具有一致性。换言之,Transformer中表征的演化过程可被解释为一条恢复分类任务最优特征的轨迹。