Recent advances in long-context Large Language Models (LCLMs) have generated significant interest, especially in applications such as querying scientific research papers. However, their potential is often limited by inadequate context utilization. We identify the absence of long-range semantic dependencies in typical training data as a primary hindrance. To address this, we delve into the benefits of frequently incorporating related documents into training inputs. Using the inherent directory structure of code data as a source of training examples, we demonstrate improvements in perplexity, even for tasks unrelated to coding. Building on these findings, but with a broader focus, we introduce Structured Packing for Long Context (SPLiCe). SPLiCe is an innovative method for creating training examples by using a retrieval method to collate the most mutually relevant documents into a single training context. Our results indicate that \method{} enhances model performance and can be used to train large models to utilize long contexts better. We validate our results by training a large $3$B model, showing both perplexity improvements and better long-context performance on downstream tasks.
翻译:近年来,长上下文大语言模型(LCLMs)的进展引发了广泛关注,尤其是在科研论文查询等应用中。然而,其潜力常受限于上下文利用不充分。我们发现典型训练数据缺乏长程语义依赖是主要障碍。为解决此问题,我们深入探究了在训练输入中频繁引入相关文档的益处。利用代码数据固有的目录结构生成训练样本,我们证明即使在非代码相关任务中,困惑度也有所提升。基于这些发现并扩展研究范围,我们提出长上下文结构化打包方法SPLiCe。SPLiCe是一种创新性训练样本构建方法,通过检索技术将最相关的文档聚合为单一训练上下文。实验结果表明,该方法能增强模型性能,并可用于训练大模型以更好利用长上下文。我们通过训练3B参数大模型验证了结果,展示了困惑度改进以及下游任务中长上下文性能的提升。