Recent advances in long-context Large Language Models (LCLMs) have generated significant interest, especially in applications such as querying scientific research papers. However, their potential is often limited by inadequate context utilization. We identify the absence of long-range semantic dependencies in typical training data as a primary hindrance. To address this, we delve into the benefits of frequently incorporating related documents into training inputs. Using the inherent directory structure of code data as a source of training examples, we demonstrate improvements in perplexity, even for tasks unrelated to coding. Building on these findings, but with a broader focus, we introduce Structured Packing for Long Context (SPLiCe). SPLiCe is an innovative method for creating training examples by using a retrieval method to collate the most mutually relevant documents into a single training context. Our results indicate that \method{} enhances model performance and can be used to train large models to utilize long contexts better. We validate our results by training a large $3$B model, showing both perplexity improvements and better long-context performance on downstream tasks.
翻译:近期长上下文大语言模型(LCLMs)的进展引起了广泛关注,特别是在科研论文查询等应用中。然而,其潜力常因上下文利用不充分而受限。我们识别出典型训练数据中缺乏长程语义依赖是主要障碍。为此,我们深入探究了频繁将相关文档纳入训练输入的优势。利用代码数据固有的目录结构作为训练样本来源,我们证明即使在非编码任务上,该方法也能改进困惑度指标。基于此发现但着眼于更广泛的应用,我们提出长上下文结构化打包(SPLiCe,Structured Packing for Long Context)。SPLiCe是一种创新的训练样本构建方法,通过检索技术将最相关的文档汇聚至单个训练上下文中。结果表明,我们的方法能提升模型性能,并可用于训练大模型以更好地利用长上下文。我们通过训练一个30亿参数的模型验证了有效性,证明其在困惑度改进和下游任务长上下文性能方面均有提升。