We systematically investigate multi-token prediction (MTP) capabilities within LLMs pre-trained for next-token prediction (NTP). We first show that such models inherently possess MTP capabilities via numerical marginalization over intermediate token probabilities, though performance is data-dependent and improves with model scale. Furthermore, we explore the challenges of integrating MTP heads into frozen LLMs and find that their hidden layers are strongly specialized for NTP, making adaptation non-trivial. Finally, we show that while joint training of MTP heads with the backbone improves performance, it cannot fully overcome this barrier, prompting further research in this direction. Our findings provide a deeper understanding of MTP applied to pretrained LLMs, informing strategies for accelerating inference through parallel token prediction.
翻译:本文系统性地研究了专为下一词元预测而预训练的大语言模型中的多词元预测能力。我们首先证明,此类模型通过中间词元概率的数值边缘化天然具备多词元预测能力,但其性能受数据依赖性影响,并随模型规模扩大而提升。进一步地,我们探讨了将多词元预测头集成到冻结大语言模型中的挑战,发现其隐藏层高度专精于下一词元预测,导致适配过程非平凡。最后,我们证明虽然将多词元预测头与主干网络联合训练能提升性能,但仍无法完全突破这一障碍,这为该方向的后续研究提供了契机。本研究深化了对预训练大语言模型应用多词元预测机制的理解,为通过并行词元预测加速推理的策略提供了理论依据。