Attention mechanisms, particularly softmax attention, have been instrumental in the success of transformer-based models such as GPT. However, the quadratic memory complexity of softmax attention with respect to sequence length poses significant challenges for processing longer sequences. We introduce Cottention, a novel attention mechanism that replaces the softmax operation with cosine similarity. By leveraging the properties of cosine similarity and rearranging the attention equation, Cottention achieves native linear memory complexity with respect to sequence length, making it inherently more memory-efficient than softmax attention. We demonstrate that Cottention can be reformulated as a recurrent neural network (RNN) with a finite hidden state, allowing for constant memory usage during inference. We evaluate Cottention on both the bidirectional BERT and causal GPT tasks, demonstrating comparable performance to softmax attention while significantly reducing memory requirements. To ensure efficient computation, we develop a custom CUDA kernel for Cottention. Our results show that Cottention is a promising alternative to softmax attention, enabling the processing of longer sequences without sacrificing performance, due to its native linear memory complexity and ability to maintain a constant memory footprint during inference.
翻译:注意力机制,特别是softmax注意力,对于GPT等基于Transformer的模型取得成功起到了关键作用。然而,softmax注意力相对于序列长度的二次方内存复杂度,给处理较长序列带来了重大挑战。我们提出了Cottention,一种新颖的注意力机制,它用余弦相似度取代了softmax操作。通过利用余弦相似度的特性并重新排列注意力方程,Cottention实现了相对于序列长度的原生线性内存复杂度,使其本质上比softmax注意力更具内存效率。我们证明了Cottention可以重新表述为一个具有有限隐藏状态的循环神经网络(RNN),从而在推理过程中实现恒定的内存使用。我们在双向BERT和因果GPT任务上评估了Cottention,结果表明其性能与softmax注意力相当,同时显著降低了内存需求。为了确保高效计算,我们为Cottention开发了定制的CUDA内核。我们的结果表明,Cottention是softmax注意力的一个有前景的替代方案,由于其原生线性内存复杂度以及在推理过程中保持恒定内存占用的能力,它能够在处理更长序列的同时不牺牲性能。