Transformer-based architectures traditionally employ softmax to compute attention weights, which produces dense distributions over all tokens in a sequence. While effective in many settings, this density has been shown to be detrimental for tasks that demand precise focus on fixed-size patterns: as sequence length increases, non-informative tokens accumulate attention probability mass, leading to dispersion and representational collapse. We show in this paper that dynamically sparse attention mechanisms using $α$-entmax can avoid these issues, due to their ability to assign exact zeros to irrelevant tokens. Furthermore, we introduce Adaptive-Scalable Entmax (ASEntmax), which endows $α$-entmax with a learnable temperature parameter, allowing the attention distribution to interpolate between sparse (pattern-focused) and dense (softmax-like) regimes. Our empirical evaluation on synthetic tasks and language modeling demonstrates that ASEntmax substantially outperforms softmax, scalable softmax, and fixed-temperature $α$-entmax baselines, achieving up to 1000$\times$ length extrapolation on synthetic benchmarks and superior long-context generalization on language modeling while preserving short-context performance, including better perplexity trends and higher retrieval accuracies at 8$\times$ training length.
翻译:基于Transformer的架构传统上采用softmax计算注意力权重,该方法会在序列的所有标记上产生稠密分布。尽管在许多场景中表现有效,但这种稠密性已被证明对需要精确聚焦固定大小模式的任务是有害的:随着序列长度增加,非信息性标记会累积注意力概率质量,导致注意力分散和表征崩溃。本文证明,使用$α$-entmax的动态稀疏注意力机制能够避免这些问题,因为它能为不相关标记分配精确的零值。此外,我们提出了自适应可伸缩熵最大化(ASEntmax),它赋予$α$-entmax一个可学习的温度参数,使注意力分布能够在稀疏(模式聚焦)和稠密(类softmax)两种机制间插值。我们在合成任务和语言建模上的实证评估表明,ASEntmax显著优于softmax、可伸缩softmax及固定温度$α$-entmax基线,在合成基准测试中实现了高达1000$\times$的长度外推,在语言建模上获得了更优的长上下文泛化能力,同时保持了短上下文性能,包括更好的困惑度趋势和在8$\times$训练长度下更高的检索准确率。