Recent developments in long-context large language models have attracted considerable attention. Yet, their real-world applications are often hindered by ineffective context information use. This work shows that structuring training data to increase semantic interdependence is an effective strategy for optimizing context utilization. To this end, we introduce Structured Packing for Long Context (SPLiCe), a method for creating training examples by using information retrieval methods to collate mutually relevant documents into a single training context. We empirically validate SPLiCe on large $3$B and $7$B models, showing perplexity improvements and better long-context utilization on downstream tasks. Remarkably, already relatively short fine-tuning with SPLiCe is enough to attain these benefits. Additionally, the comprehensive study of SPLiCe reveals intriguing transfer effects such as training on code data leading to perplexity improvements on text data.
翻译:近期长上下文大型语言模型的发展引起了广泛关注,然而其实际应用常因上下文信息利用效率低下而受阻。本研究表明,通过结构化训练数据以增强语义相互依存性是优化上下文利用的有效策略。为此,我们提出面向长上下文的结构化打包方法(SPLiCe),该方法利用信息检索技术将相互关联的文档整理为单一训练上下文。我们通过大规模3B和7B参数模型的实证验证,证明SPLiCe在困惑度指标及下游任务长上下文利用方面均取得改进。值得注意的是,仅需使用SPLiCe进行相对较短的微调即可获得上述收益。此外,对SPLiCe的全面研究揭示了有趣的迁移效应,例如在代码数据上训练可带来文本数据困惑度的改善。