Past vocabulary learning techniques identify relevant vocabulary before training, relying on statistical and entropy-based assumptions that largely neglect the role of model training. Empirically, we observe that trained translation models are induced to use a byte-pair encoding (BPE) vocabulary subset distinct from the original BPE vocabulary, leading to performance improvements when retrained with the induced vocabulary. In this paper, we analyze this discrepancy in neural machine translation by examining vocabulary and entropy shifts during self-training--where each iteration generates a labeled dataset by pairing source sentences with the model's predictions to define a new vocabulary. Building on these insights, we propose self-vocabularizing training, an iterative method that self-selects a smaller, more optimal vocabulary, yielding up to a 1.49 BLEU improvement. Moreover, we find that deeper model architectures lead to both an increase in unique token usage and a 6-8% reduction in vocabulary size.
翻译:以往的词汇学习技术在训练前识别相关词汇,主要依赖统计和基于熵的假设,很大程度上忽视了模型训练的作用。实证研究发现,训练后的翻译模型倾向于使用与原始字节对编码(BPE)词汇表不同的子集,当使用诱导出的词汇表重新训练时能带来性能提升。本文通过分析自训练过程中词汇与熵的变化——每次迭代通过将源语句与模型预测配对生成标注数据集以定义新词汇,来探究神经机器翻译中这种差异的成因。基于这些发现,我们提出自词汇化训练方法,这种迭代式算法能自主选择更小、更优的词汇表,实现最高1.49 BLEU值的提升。此外,研究发现更深的模型架构不仅增加了独特标记的使用量,还能使词汇表规模缩减6-8%。