Data scarcity drives the need for more sample-efficient large language models. In this work, we use the double descent phenomenon to holistically compare the sample efficiency of discrete diffusion and autoregressive models. We show that discrete diffusion models require larger capacity and more training epochs to escape their underparameterized regime and reach the interpolation threshold. In the strongly overparameterized regime, both models exhibit similar behavior, with neither exhibiting a pronounced second descent in test loss across a large range of model sizes. Overall, our results indicate that autoregressive models are more sample-efficient on small-scale datasets, while discrete diffusion models only become competitive when given sufficient capacity and compute.
翻译:数据稀缺性推动了对更具样本效率的大型语言模型的需求。在本研究中,我们利用双重下降现象,系统性地比较了离散扩散模型与自回归模型的样本效率。研究表明,离散扩散模型需要更大的模型容量和更多的训练轮次才能脱离欠参数化状态并达到插值阈值。在强过参数化状态下,两种模型表现出相似的行为,在广泛的模型规模范围内均未在测试损失上呈现显著的第二下降。总体而言,我们的结果表明,自回归模型在小规模数据集上具有更高的样本效率,而离散扩散模型只有在获得足够的模型容量和计算资源时才能具备竞争力。