We propose a simple modification to the conventional attention mechanism applied by Transformers: Instead of quantifying pairwise query-key similarity with scaled dot-products, we quantify it with the logarithms of scaled dot-products of exponentials. Attention becomes expressible as a composition of log-sums of exponentials that is linearizable, with a latent space of constant size, enabling sequential application with constant time and space complexity per token. We implement our modification, verify that it works in practice, and conclude that it is a promising alternative to conventional attention.
翻译:我们提出对Transformer中使用的传统注意力机制进行简单修改:不再通过缩放点积量化查询-键对的相似度,而是通过指数级缩放点积的对数进行量化。这使得注意力可表示为可线性化的对数-指数和组合,具有恒定大小的潜在空间,从而实现对每个token以常数时间和空间复杂度进行顺序应用。我们实现了该修改,验证了其实用性,并认为它是传统注意力机制的有前景替代方案。