We propose Next Concept Prediction (NCP), a generative pretraining paradigm built on top of Next Token Prediction (NTP). NCP predicts discrete concepts that span multiple tokens, thereby forming a more challenging pretraining objective. Our model, ConceptLM, quantizes hidden states using Vector Quantization and constructs a concept vocabulary. It leverages both NCP and NTP to drive parameter updates and generates a concept to guide the generation of the following tokens. We train ConceptLM from scratch at scales ranging from 70M to 1.5B parameters with up to 300B training data, including Pythia and GPT-2 backbones. Results on 13 benchmarks show that NCP yields consistent performance gains over traditional token-level models. Furthermore, continual pretraining experiments on an 8B-parameter Llama model indicate that NCP can further improve an NTP-trained model. Our analysis suggests that NCP leads to more powerful language models by introducing a harder pretraining task, providing a promising path toward better language modeling.
翻译:我们提出下一概念预测(NCP),一种构建于下一词元预测(NTP)之上的生成式预训练范式。NCP预测跨越多个词元的离散概念,从而形成更具挑战性的预训练目标。我们的模型ConceptLM使用向量量化对隐藏状态进行离散化,构建概念词汇表。它同时利用NCP和NTP驱动参数更新,并生成概念以指导后续词元的生成。我们以7000万至15亿参数规模(包括Pythia和GPT-2架构)从头训练ConceptLM,训练数据量达3000亿。在13个基准测试上的结果表明,NCP相比传统词元级模型能带来持续的性能提升。此外,在80亿参数Llama模型上进行的持续预训练实验表明,NCP能够进一步提升经NTP训练的模型。我们的分析指出,NCP通过引入更困难的预训练任务,可产生更强大的语言模型,为改进语言建模提供了可行路径。