Transformer-based language models rely on positional encoding (PE) to handle token order and support context length extrapolation. However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims. We propose the Bayesian Attention Mechanism (BAM), a theoretical framework that formulates positional encoding as a prior within a probabilistic model. BAM unifies existing methods (e.g., NoPE and ALiBi) and motivates a new Generalized Gaussian positional prior that substantially improves long-context generalization. Empirically, BAM enables accurate information retrieval at $500\times$ the training context length, outperforming previous state-of-the-art context length generalization in long context retrieval accuracy while maintaining comparable perplexity and introducing minimal additional parameters.
翻译:基于Transformer的语言模型依赖位置编码(PE)来处理词元顺序并支持上下文长度外推。然而,现有的位置编码方法缺乏理论清晰性,且依赖有限的评估指标来证实其外推主张。我们提出贝叶斯注意力机制(BAM),这是一个将位置编码表述为概率模型中先验的理论框架。BAM统一了现有方法(例如NoPE和ALiBi),并推导出一种新的广义高斯位置先验,该先验显著改善了长上下文泛化能力。实验表明,BAM能够在训练上下文长度$500$倍的尺度上实现精确的信息检索,在长上下文检索准确率方面超越了先前最先进的上下文长度泛化方法,同时保持了相当的困惑度并仅引入了极少的额外参数。