Generalizing to longer sentences is important for recent Transformer-based language models. Besides algorithms manipulating explicit position features, the success of Transformers without position encodings (NoPE) provides a new way to overcome the challenge. In this paper, we study the length generalization property of NoPE. We find that although NoPE can extend to longer sequences than the commonly used explicit position encodings, it still has a limited context length. We identify a connection between the failure of NoPE's generalization and the distraction of attention distributions. We propose a parameter-efficient tuning for searching attention heads' best temperature hyper-parameters, which substantially expands NoPE's context size. Experiments on long sequence language modeling, the synthetic passkey retrieval task and real-world long context tasks show that NoPE can achieve competitive performances with state-of-the-art length generalization algorithms. The source code is publicly accessible
翻译:对于近期基于Transformer的语言模型而言,泛化到更长的句子至关重要。除了操纵显式位置特征的算法外,无位置编码Transformer的成功为克服这一挑战提供了新途径。本文研究了无位置编码的长度泛化特性。我们发现,尽管无位置编码能够处理比常用显式位置编码更长的序列,但其上下文长度仍然有限。我们揭示了无位置编码泛化失效与注意力分布分散之间的关联。我们提出了一种参数高效的调优方法,用于搜索注意力头的最佳温度超参数,从而显著扩展无位置编码的上下文规模。在长序列语言建模、合成密码检索任务及真实长上下文任务上的实验表明,无位置编码能够与最先进的长度泛化算法取得竞争性表现。源代码已公开提供。