The time complexity of the standard attention mechanism in transformers scales quadratically with sequence length. We propose a probabilistic framework for attention, enabling us to derive a novel low-rank linear re-parameterisation of both bidirectional and causal cases, based on defining a latent variable model. Our method can be seamlessly integrated as a drop-in replacement for the standard attention mechanism. Additionally, this framework provides a natural extension for combining local standard attention with our global linear attention. This approach allows us to extend the context length of existing large pre-trained models with only a few additional training steps. The resulting ``Latte Transformer'' achieves performance comparable to standard attention and other state-of-the-art models, while maintaining linear time and memory complexity, along with constant-time next-token prediction during inference.
翻译:Transformer中的标准注意力机制的时间复杂度随序列长度呈二次方增长。我们提出了一种概率化的注意力框架,通过定义潜在变量模型,为双向和因果两种情况推导出一种新颖的低秩线性重参数化方法。我们的方法可作为标准注意力机制的直接替代方案无缝集成。此外,该框架为将局部标准注意力与我们的全局线性注意力相结合提供了自然扩展。这种方法使我们能够仅通过少量额外训练步骤,即可扩展现有大型预训练模型的上下文长度。所提出的"Latte Transformer"在保持线性时间和内存复杂度的同时,实现了与标准注意力及其他最先进模型相当的性能,并在推理过程中具备常数时间的下一词元预测能力。