Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More specifically, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared model trunk. Considering multi-token prediction as an auxiliary training task, we measure improved downstream capabilities with no overhead in training time for both code and natural language models. The method is increasingly useful for larger model sizes, and keeps its appeal when training for multiple epochs. Gains are especially pronounced on generative benchmarks like coding, where our models consistently outperform strong baselines by several percentage points. Our 13B parameter models solves 12 % more problems on HumanEval and 17 % more on MBPP than comparable next-token models. Experiments on small algorithmic tasks demonstrate that multi-token prediction is favorable for the development of induction heads and algorithmic reasoning capabilities. As an additional benefit, models trained with 4-token prediction are up to 3 times faster at inference, even with large batch sizes.
翻译:诸如GPT和Llama等大型语言模型在训练时采用下一词元预测损失函数。在本研究中,我们提出同时预测多个未来词元的训练方法能够提高样本效率。具体而言,在训练语料的每个位置处,我们要求模型通过共享模型主干上操作的n个独立输出头,预测后续n个词元。将多词元预测作为辅助训练任务,我们在代码和自然语言模型上均观测到下游能力的提升,且不增加训练时间开销。该方法对更大规模的模型效果更为显著,并在多轮训练中保持优势。在生成式基准测试(如代码生成)中,收益尤为突出——我们的模型始终以数个百分点的优势超越强基线方法。参数量为130亿的模型在HumanEval和MBPP测试集上的问题解决率分别比同等规模的下一个词元预测模型高出12%和17%。小型算法任务的实验表明,多词元预测有利于归纳头与算法推理能力的发展。额外优势在于,采用4词元预测训练的模型在推理时速度提升高达3倍,即便在大批量处理场景下依然如此。