E^2-LLM: Efficient and Extreme Length Extension of Large Language Models

Jiaheng Liu,Zhiqi Bai,Yuanxing Zhang,Chenchen Zhang,Yu Zhang,Ge Zhang,Jiakai Wang,Haoran Que,Yukang Chen,Wenbo Su,Tiezheng Ge,Jie Fu,Wenhu Chen,Bo Zheng

Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. Existing long-context extension methods usually need additional training procedures to support corresponding long-context windows, where the long-context training data (e.g., 32k) is needed, and high GPU training costs are assumed. To address the aforementioned issues, we propose an Efficient and Extreme length extension method for Large Language Models, called E 2 -LLM, with only one training procedure and dramatically reduced computation cost, which also removes the need to collect long-context data. Concretely, first, the training data of our E 2 -LLM only requires a short length (e.g., 4k), which reduces the tuning cost greatly. Second, the training procedure on the short training context window is performed only once time, and we can support different evaluation context windows at inference. Third, in E 2 - LLM, based on RoPE position embeddings, we introduce two different augmentation methods on the scale and position index parameters for different samples in training. It aims to make the model more robust to the different relative differences when directly interpolating the arbitrary context length at inference. Comprehensive experimental results on multiple benchmark datasets demonstrate the effectiveness of our E 2 -LLM on challenging long-context tasks.

翻译：通常，训练具有长上下文窗口的大型语言模型在计算上代价高昂，需要大量的训练时间和GPU资源。现有的长上下文扩展方法通常需要额外的训练流程来支持相应的长上下文窗口，其中需要长上下文训练数据（例如32k），并假设了高昂的GPU训练成本。为解决上述问题，我们提出了一种针对大型语言模型的高效极端长度扩展方法，称为E^2-LLM，该方法仅需一次训练流程且大幅降低了计算成本，同时无需收集长上下文数据。具体而言，首先，我们的E^2-LLM训练数据仅需短长度（例如4k），这显著降低了微调成本。其次，在短训练上下文窗口上的训练过程仅执行一次，即可在推理时支持不同的评估上下文窗口。第三，在E^2-LLM中，基于RoPE位置编码，我们针对训练中不同样本的尺度参数和位置索引参数引入了两种不同的增强方法，旨在使模型在推理时直接插值任意上下文长度时，对不同相对差异具有更强的鲁棒性。在多个基准数据集上的综合实验结果表明，我们的E^2-LLM在具有挑战性的长上下文任务中具有有效性。

相关内容