Extending large language models to effectively handle long contexts requires instruction fine-tuning on input sequences of similar length. To address this, we present LongAlign -- a recipe of the instruction data, training, and evaluation for long context alignment. First, we construct a long instruction-following dataset using Self-Instruct. To ensure the data diversity, it covers a broad range of tasks from various long context sources. Second, we adopt the packing and sorted batching strategies to speed up supervised fine-tuning on data with varied length distributions. Additionally, we develop a loss weighting method to balance the contribution to the loss across different sequences during packing training. Third, we introduce the LongBench-Chat benchmark for evaluating instruction-following capabilities on queries of 10k-100k in length. Experiments show that LongAlign outperforms existing recipes for LLMs in long context tasks by up to 30\%, while also maintaining their proficiency in handling short, generic tasks. The code, data, and long-aligned models are open-sourced at https://github.com/THUDM/LongAlign.
翻译:将大型语言模型扩展以有效处理长上下文,需要在相似长度的输入序列上进行指令微调。为此,我们提出LongAlign——一种面向长上下文对齐的指令数据、训练与评估方案。首先,我们利用Self-Instruct方法构建了一个长指令遵循数据集。为确保数据多样性,该数据集覆盖了来自多种长上下文源的广泛任务。其次,我们采用打包与排序批处理策略,加速对不同长度分布数据的监督微调。此外,我们开发了一种损失加权方法,以平衡打包训练过程中不同序列对损失的贡献。最后,我们引入LongBench-Chat基准,用于评估模型在1万至10万长度查询上的指令遵循能力。实验表明,LongAlign在长上下文任务上的性能比现有LLM方案提升高达30%,同时保持其在短文本通用任务上的处理能力。代码、数据及长对齐模型已开源至https://github.com/THUDM/LongAlign。