This paper combines autoregressive and masked-diffusion training objectives without any architectural modifications, resulting in flexible language models that outperform single-objective models. Autoregressive modeling has been a popular approach, partly because of its training efficiency; however, that comes at the cost of sensitivity to overfitting. On the other hand, masked-diffusion models are less efficient to train while being more resilient to overfitting. In this work, we demonstrate that dual-objective training achieves the best of both worlds. To derive the optimal balance between both objectives, we train and evaluate 50 language models under varying levels of data repetition. We show that it is optimal to combine both objectives under all evaluated settings and that the optimal balance is similar whether targeting autoregressive or masked-diffusion downstream performance.
翻译:本文在不进行任何架构修改的情况下,结合了自回归与掩码扩散训练目标,从而构建出性能优于单目标模型的灵活语言模型。自回归建模因其训练效率优势而广受欢迎,但这种优势是以对过拟合的敏感性为代价的。另一方面,掩码扩散模型虽然训练效率较低,却具有更强的抗过拟合能力。本研究表明,双目标训练能够同时兼顾二者的优势。为确定两个目标之间的最优平衡,我们在不同程度的数据重复条件下训练并评估了50个语言模型。实验结果表明,在所有评估场景中结合两种目标均为最优策略,且无论以下游自回归性能还是掩码扩散性能为目标,其最优平衡点均具有相似性。