An ideal length-extrapolatable Transformer language model can handle sequences longer than the training length without any fine-tuning. Such long-context utilization capability relies heavily on a flexible positional embedding design. Upon investigating the flexibility of existing large pre-trained Transformer language models, we find that the T5 family deserves a closer look, as its positional embeddings capture rich and flexible attention patterns. However, T5 suffers from the dispersed attention issue: the longer the input sequence, the flatter the attention distribution. To alleviate the issue, we propose two attention alignment strategies via temperature scaling. Our findings show improvement on the long-context utilization capability of T5 on language modeling, retrieval, multi-document question answering, and code completion tasks without any fine-tuning. This suggests that a flexible positional embedding design and attention alignment can go a long way toward Transformer length extrapolation.
翻译:理想的长度可外推Transformer语言模型无需微调即可处理超过训练长度的序列。这种长上下文利用能力高度依赖于灵活的位置嵌入设计。在考察现有大规模预训练Transformer语言模型的位置嵌入灵活性时,我们发现T5系列模型值得进一步研究,其位置嵌入能捕获丰富且灵活的注意力模式。然而,T5存在注意力分散问题:输入序列越长,注意力分布越平坦。为缓解该问题,我们提出通过温度缩放实现两种注意力对齐策略。实验结果表明,在不进行任何微调的情况下,我们的方法能提升T5在语言建模、检索、多文档问答和代码补全任务中的长上下文利用能力。这表明灵活的位置嵌入设计与注意力对齐机制对Transformer长度外推具有重要推动作用。