Next-token prediction (NTP) is the cornerstone of modern large language models (LLMs) pretraining, driving their unprecedented capabilities in text generation, reasoning, and instruction following. However, the token-level prediction limits the model's capacity to capture higher-level semantic structures and long-range contextual relationships. To overcome this limitation, we introduce \textbf{ContextLM}, a framework that augments standard pretraining with an inherent \textbf{next-context prediction} objective. This mechanism trains the model to learn predictive representations of multi-token contexts, leveraging error signals derived from future token chunks. Crucially, ContextLM achieves this enhancement while remaining fully compatible with the standard autoregressive, token-by-token evaluation paradigm (e.g., perplexity). Extensive experiments on the GPT2 and Pythia model families, scaled up to $1.5$B parameters, show that ContextLM delivers consistent improvements in both perplexity and downstream task performance. Our analysis indicates that next-context prediction provides a scalable and efficient pathway to stronger language modeling, yielding better long-range coherence and more effective attention allocation with minimal computational overhead.
翻译:下一词预测是现代大规模语言模型预训练的基石,推动了其在文本生成、推理和指令遵循方面前所未有的能力。然而,词元级预测限制了模型捕捉更高层次语义结构和长距离上下文关系的能力。为克服这一局限,我们引入了\\textbf{ContextLM}框架,该框架通过内在的\\textbf{下一上下文预测}目标增强标准预训练。该机制训练模型学习多词元上下文的预测性表示,利用来自未来词元块的误差信号。关键的是,ContextLM在实现这一增强的同时,仍完全兼容标准的自回归、逐词元评估范式(如困惑度)。在GPT2和Pythia模型系列上进行的扩展实验(参数规模高达$1.5$B)表明,ContextLM在困惑度和下游任务性能上均带来了一致的提升。我们的分析表明,下一上下文预测为更强的语言建模提供了一条可扩展且高效的路径,以最小的计算开销实现了更好的长距离连贯性和更有效的注意力分配。