One of the key challenges in Transformer architectures is the quadratic complexity of the attention mechanism, which limits the efficient processing of long sequences. Many recent research works have attempted to provide a reduction from the $O(n^2)$ time complexity of attention to semi-linear complexity. However, it remains an unsolved problem in the sense of maintaining high performance when complexity is reduced. One of the important works in this respect is the Perceiver class of architectures that have demonstrated excellent performance, while reducing the computation complexity. In this paper, we use the PerceiverAR as a basis and explore the design space of different trade-offs between preserving context and reducing attention complexity. To this end, we develop four new architectural paradigms, the best performing of which we denote as the Efficient Context propagating Perceiver (ECP). ECP has two major advantages over the PerceiverAR. First, the ECP architecture overcomes the main drawback of PercieverAR by utilizing both the context and the latent sequences in autoregressive training. Second, the ECP architecture operates with the same attention complexity as LongLoRA, making it computationally efficient. More importantly, via pairwise segment attention, it extracts better information resulting in improved language modeling. Empirically, we demonstrate that the ECP architecture significantly outperforms other state-of-the-art Transformer models on Wikitext-103, PG-19 and sCIFAR-10.
翻译:Transformer架构面临的一个关键挑战在于注意力机制的二次复杂度,这限制了长序列的高效处理。许多近期研究工作试图将注意力机制的$O(n^2)$时间复杂度降低至半线性复杂度。然而,如何在降低复杂度的同时保持高性能,这仍然是一个悬而未决的问题。该领域的一项重要成果是感知器(Perceiver)架构系列,其在降低计算复杂度的同时展现出卓越性能。本文以PerceiverAR为基础,探索了在保持上下文与降低注意力复杂度之间进行权衡的设计空间。为此,我们提出了四种新的架构范式,其中性能最优的架构被命名为高效上下文传播感知器(ECP)。相较于PerceiverAR,ECP具有两大核心优势:首先,ECP架构通过同时利用上下文与潜在序列进行自回归训练,克服了PerceiverAR的主要缺陷;其次,ECP架构的注意力复杂度与LongLoRA保持一致,具有显著的计算效率。更重要的是,该架构通过成对片段注意力机制提取更优的信息特征,从而提升语言建模性能。实验结果表明,ECP架构在Wikitext-103、PG-19和sCIFAR-10数据集上显著优于其他最先进的Transformer模型。