Since the introduction of the Transformer architecture for large language models, the softmax-based attention layer has faced increasing scrutinity due to its quadratic-time computational complexity. Attempts have been made to replace it with less complex methods, at the cost of reduced performance in most cases. We introduce Hierarchical Shift Mixing (HSM), a general framework for token mixing that distributes pairwise token interactions across Transformer layers rather than computing them densely within each layer. HSM enables linear-time complexity while remaining agnostic to the specific mixing function. We show that even simple HSM variants achieve performance close to softmax attention, and that hybrid architectures combining HSM with softmax attention can outperform a GPT-style Transformer baseline while reducing computational cost during both training and inference.
翻译:自Transformer架构被引入大型语言模型以来,基于softmax的注意力层因其二次时间复杂度而日益受到审视。已有研究尝试用复杂度更低的方法替代该机制,但在多数情况下会以性能下降为代价。本文提出层次化位移混合(HSM),一种通用的令牌混合框架,将成对令牌交互分布在Transformer各层中计算,而非在单层内进行密集计算。HSM在保持对具体混合函数无关性的同时,实现了线性时间复杂度。研究表明,即使简单的HSM变体也能达到接近softmax注意力的性能,而将HSM与softmax注意力结合的混合架构,在降低训练和推理计算成本的同时,其性能可超越GPT风格的Transformer基线模型。