Recent developments in long-context large language models have attracted considerable attention. Yet, their real-world applications are often hindered by ineffective context information use. This work shows that structuring training data to increase semantic interdependence is an effective strategy for optimizing context utilization. To this end, we introduce Structured Packing for Long Context (SPLiCe), a method for creating training examples by using information retrieval methods to collate mutually relevant documents into a single training context. We empirically validate SPLiCe on large $3$B and $7$B models, showing perplexity improvements and better long-context utilization on downstream tasks. Remarkably, already relatively short fine-tuning with SPLiCe is enough to attain these benefits. Additionally, the comprehensive study of SPLiCe reveals intriguing transfer effects such as training on code data leading to perplexity improvements on text data.
翻译:近期长上下文大语言模型的发展引起了广泛关注。然而,这些模型在实际应用中往往受限于对上下文信息的低效利用。本研究表明,通过结构化训练数据以增强语义互依性是优化上下文利用率的有效策略。为此,我们提出面向长上下文的结构化打包方法(SPLiCe),该方法利用信息检索技术将相互关联的文档整理为单个训练上下文示例。我们在大型3B和7B规模模型上对SPLiCe进行了实证验证,结果表明该方法在困惑度指标及下游任务的长上下文利用率方面均取得改进。值得注意的是,仅需使用SPLiCe进行相对短期的微调即可获得上述收益。此外,对SPLiCe的全面研究揭示了有趣的迁移效应,例如在代码数据上的训练能够提升文本数据的困惑度表现。