In this work, we introduce a semiparametric token-sequence co-supervision training method. It trains a language model by simultaneously leveraging supervision from the traditional next token prediction loss which is calculated over the parametric token embedding space and the next sequence prediction loss which is calculated over the nonparametric sequence embedding space. The nonparametric sequence embedding space is constructed by a separate language model tasked to condense an input text into a single representative embedding. Our experiments demonstrate that a model trained via both supervisions consistently surpasses models trained via each supervision independently. Analysis suggests that this co-supervision encourages a broader generalization capability across the model. Especially, the robustness of parametric token space which is established during the pretraining step tends to effectively enhance the stability of nonparametric sequence embedding space, a new space established by another language model.
翻译:本文提出了一种半参数令牌-序列协同监督训练方法。该方法通过同时利用两种监督信号训练语言模型:一是基于参数化令牌嵌入空间计算的传统下一令牌预测损失,二是基于非参数序列嵌入空间计算的下一序列预测损失。非参数序列嵌入空间由另一个负责将输入文本压缩为单一代表性嵌入的语言模型构建。实验表明,采用两种监督联合训练的模型性能始终优于单独使用任何一种监督训练的模型。分析显示,这种协同监督机制能够促进模型更广泛的泛化能力。特别地,预训练阶段建立的参数化令牌空间的鲁棒性,能有效增强另一语言模型新构建的非参数序列嵌入空间的稳定性。