Recent advancements in long-context large language models have attracted significant attention, yet their practical applications often suffer from suboptimal context utilization. This study investigates structuring training data to enhance semantic interdependence, demonstrating that this approach effectively improves context utilization. To this end, we introduce the Structured Packing for Long Context (SPLiCe) method, which utilizes retrieval to collate mutually relevant documents into long and coherent training examples. We validate SPLiCe empirically across models of varying sizes -- 3B, 7B, and 13B -- achieving improved performance in long-context tasks, such as Qasper and HotpotQA. Remarkably, even brief fine-tuning with SPLiCe is sufficient to realize these benefits. Additionally, SPLiCe effectively mitigates the lost-in-middle phenomenon often observed in large models. Our comprehensive analysis of SPLiCe explores its design choices and reveals intriguing transfer effects; for instance, training on programming code enhances performance on natural language tasks.
翻译:近期长上下文大语言模型的发展引起了广泛关注,但其实际应用常受限于上下文利用效率不足的问题。本研究通过结构化训练数据以增强语义关联性,证明该方法能有效提升上下文利用效率。为此,我们提出了面向长上下文的结构化打包方法(SPLiCe),该方法通过检索机制将语义相关的文档整合为长而连贯的训练样本。我们在3B、7B和13B等不同规模的模型上对SPLiCe进行了实证验证,在Qasper和HotpotQA等长上下文任务中均取得了性能提升。值得注意的是,即使仅进行短时微调,SPLiCe也能实现显著增益。此外,该方法有效缓解了大模型中常见的"中间信息丢失"现象。我们通过系统分析探讨了SPLiCe的设计机制,并发现了有趣的迁移效应:例如基于编程代码的训练能提升自然语言任务的处理性能。