The Transformer architecture, a cornerstone of modern Large Language Models (LLMs), has achieved extraordinary success in sequence modeling, primarily due to its attention mechanism. However, despite its power, the standard attention mechanism is plagued by well-documented issues: representational collapse and attention sink. Although prior work has proposed approaches for these issues, they are often studied in isolation, obscuring their deeper connection. In this paper, we present a unified perspective, arguing that both can be traced to a common root -- improper attention allocation. We identify two failure modes: 1) Attention Overload, where tokens receive comparable high weights, blurring semantic features that lead to representational collapse; 2) Attention Underload, where no token is semantically relevant, yet attention is still forced to distribute, resulting in spurious focus such as attention sink. Building on this insight, we introduce Lazy Attention, a novel mechanism designed for a more focused attention distribution. To mitigate overload, it employs positional discrimination across both heads and dimensions to sharpen token distinctions. To counteract underload, it incorporates Elastic-Softmax, a modified normalization function that relaxes the standard softmax constraint to suppress attention on irrelevant tokens. Experiments on the FineWeb-Edu corpus, evaluated across nine diverse benchmarks, demonstrate that Lazy Attention successfully mitigates attention sink and achieves competitive performance compared to both standard attention and modern architectures, while reaching up to 59.58% attention sparsity.
翻译:Transformer架构作为现代大语言模型(LLM)的基石,在序列建模中取得了非凡的成功,这主要归功于其注意力机制。然而,尽管其功能强大,标准注意力机制仍受到两个公认问题的困扰:表征坍缩和注意力沉没。尽管先前的研究已针对这些问题提出了解决方案,但它们往往被孤立地研究,掩盖了其更深层的联系。本文提出一个统一视角,认为两者均可追溯至一个共同的根源——不当的注意力分配。我们识别了两种失效模式:1)注意力过载,即多个词元获得相近的高权重,模糊了语义特征,导致表征坍缩;2)注意力欠载,即没有词元在语义上相关,但注意力仍被迫分配,导致产生如注意力沉没的虚假聚焦。基于此洞见,我们提出了Lazy Attention,一种旨在实现更聚焦注意力分布的新机制。为缓解过载,它利用跨注意力头和维度的位置判别来锐化词元区分度。为应对欠载,它引入了Elastic-Softmax,一种改进的归一化函数,通过放松标准softmax约束来抑制对无关词元的注意力。在FineWeb-Edu语料库上进行的实验,经过九个多样化基准评估,表明Lazy Attention成功缓解了注意力沉没现象,并在与标准注意力及现代架构的对比中取得了具有竞争力的性能,同时实现了高达59.58%的注意力稀疏度。