Recently, many methods have been developed to extend the context length of pre-trained large language models (LLMs), but they often require fine-tuning at the target length ($\gg4K$) and struggle to effectively utilize information from the middle part of the context. To address these issues, we propose $\textbf{C}$ontinuity-$\textbf{R}$elativity ind$\textbf{E}$xing with g$\textbf{A}$ussian $\textbf{M}$iddle (CREAM), which interpolates positional encodings by manipulating position indices. Apart from being simple, CREAM is training-efficient: it only requires fine-tuning at the pre-trained context window (eg, Llama 2-4K) and can extend LLMs to a much longer target context length (eg, 256K). To ensure that the model focuses more on the information in the middle, we introduce a truncated Gaussian to encourage sampling from the middle part of the context during fine-tuning, thus alleviating the ``Lost-in-the-Middle'' problem faced by long-context LLMs. Experimental results show that CREAM successfully extends LLMs to the target length for both Base and Chat versions of $\texttt{Llama2-7B}$ with ``Never Miss A Beat''. Our code will be publicly available soon.
翻译:近期,众多方法被提出用于扩展预训练大语言模型(LLMs)的上下文长度,但这些方法通常需要在目标长度($\gg4K$)上进行微调,且难以有效利用上下文中间部分的信息。为解决这些问题,我们提出一种通过操纵位置索引进行位置编码插值的方法——$\textbf{C}$ontinuity-$\textbf{R}$elativity ind$\textbf{E}$xing with g$\textbf{A}$ussian $\textbf{M}$iddle(CREAM)。该方法不仅结构简单,且训练高效:仅需在预训练上下文窗口(例如Llama 2-4K)上进行微调,即可将LLMs扩展至更长的目标上下文长度(例如256K)。为确保模型更关注中间部分的信息,我们引入截断高斯分布以鼓励在微调过程中从上下文中间部分采样,从而缓解长上下文LLMs面临的“中间信息丢失”问题。实验结果表明,CREAM成功将$\texttt{Llama2-7B}$的基础版和对话版扩展至目标长度,实现了“从不遗漏关键信息”的效果。我们的代码即将公开。