Transformer-based large language models (LLMs) typically have a limited context window, resulting in significant performance degradation when processing text beyond the length of the context window. Extensive studies have been proposed to extend the context window and achieve length extrapolation of LLMs, but there is still a lack of in-depth interpretation of these approaches. In this study, we explore the positional information within and beyond the context window for deciphering the underlying mechanism of LLMs. By using a mean-based decomposition method, we disentangle positional vectors from hidden states of LLMs and analyze their formation and effect on attention. Furthermore, when texts exceed the context window, we analyze the change of positional vectors in two settings, i.e., direct extrapolation and context window extension. Based on our findings, we design two training-free context window extension methods, positional vector replacement and attention window extension. Experimental results show that our methods can effectively extend the context window length.
翻译:基于Transformer的大语言模型(LLMs)通常具有有限的上下文窗口,导致处理超出上下文窗口长度的文本时性能显著下降。已有大量研究提出扩展上下文窗口以实现LLMs的长度外推,但仍缺乏对这些方法的深入解释。本研究通过探索上下文窗口内外的位置信息,以解析LLMs的内在机制。通过采用基于均值的分解方法,我们从LLMs的隐藏状态中分离出位置向量,并分析其形成过程及其对注意力机制的影响。此外,当文本超出上下文窗口时,我们在两种设置下分析位置向量的变化,即直接外推和上下文窗口扩展。基于研究发现,我们设计了两种无需训练的上下文窗口扩展方法:位置向量替换和注意力窗口扩展。实验结果表明,我们的方法能有效扩展上下文窗口长度。