Traditional language models operate autoregressively, i.e., they predict one token at a time. Rapid explosion in model sizes has resulted in high inference times. In this work, we propose DynaMo, a suite of multi-token prediction language models that reduce net inference times. Our models $\textit{dynamically}$ predict multiple tokens based on their confidence in the predicted joint probability distribution. We propose a lightweight technique to train these models, leveraging the weights of traditional autoregressive counterparts. Moreover, we propose novel ways to enhance the estimated joint probability to improve text generation quality, namely co-occurrence weighted masking and adaptive thresholding. We also propose systematic qualitative and quantitative methods to rigorously test the quality of generated text for non-autoregressive generation. One of the models in our suite, DynaMo-7.3B-T3, achieves same-quality generated text as the baseline (Pythia-6.9B) while achieving 2.57$\times$ speed-up with only 5.87% and 2.67% parameter and training time overheads, respectively.
翻译:传统的语言模型采用自回归方式运行,即每次只预测一个词元。随着模型规模的急剧增长,推理时间显著增加。本研究提出DynaMo——一套能够降低净推理时间的多词元预测语言模型套件。我们的模型基于对预测联合概率分布的置信度,$\textit{动态地}$预测多个词元。我们提出了一种轻量化训练方法,通过继承传统自回归模型的权重实现训练。此外,我们提出了两种新颖的联合概率估计增强方法——共现加权掩码和自适应阈值技术,以提升文本生成质量。我们还设计了系统性的定性与定量评估方法,严格测试非自回归生成文本的质量。该套件中的DynaMo-7.3B-T3模型在生成与基线模型(Pythia-6.9B)同等质量文本的前提下,实现了2.57$\times$的加速比,而参数规模与训练时间开销分别仅为5.87%和2.67%。