Attention involves comparing query and key vectors in terms of a scalar product, $\mathbf{Q}^T\mathbf{K}$, together with a subsequent softmax normalization. Classicaly, parallel/orthogonal/antiparallel queries and keys lead to large/intermediate/small attention weights. Here we study expressive attention (EA), which is based on $(\mathbf{Q}^T\mathbf{K})^2$, the squared dot product. In this case attention is enhanced when query and key are either parallel or antiparallel, and suppressed for orthogonal configurations. For a series of autoregressive prediction tasks, we find that EA performs at least as well as the standard mechanism, dot-product attention (DPA). Increasing task complexity, EA is observed to outperform DPA with increasing margins, which also holds for multi-task settings. For a given model size, EA manages to achieve 100\% performance for a range of complexity levels not accessible to DPA.
翻译:注意力机制涉及查询向量与键向量之间的标量积比较,即 $\mathbf{Q}^T\mathbf{K}$,并随后进行 softmax 归一化。在经典机制中,平行/正交/反平行的查询与键分别对应较大/中等/较小的注意力权重。本文研究基于平方点积 $(\mathbf{Q}^T\mathbf{K})^2$ 的表达性注意力(EA)。在此机制下,当查询与键呈平行或反平行关系时注意力得到增强,而在正交构型下则受到抑制。在一系列自回归预测任务中,我们发现 EA 的表现至少与标准机制——点积注意力(DPA)相当。随着任务复杂度的增加,EA 以逐渐扩大的优势超越 DPA,这一现象在多任务场景中同样成立。在给定模型规模下,EA 能够在一系列 DPA 无法达到的复杂度层级上实现 100% 的性能表现。