It has become common practice now to use random initialization schemes, rather than the pre-trained embeddings, when training transformer based models from scratch. Indeed, we find that pre-trained word embeddings from GloVe, and some sub-word embeddings extracted from language models such as T5 and mT5 fare much worse compared to random initialization. This is counter-intuitive given the well-known representational and transfer-learning advantages of pre-training. Interestingly, we also find that BERT and mBERT embeddings fare better than random initialization, showing the advantages of pre-trained representations. In this work, we posit two potential factors that contribute to these mixed results: the model sensitivity to parameter distribution and the embedding interactions with position encodings. We observe that pre-trained GloVe, T5, and mT5 embeddings have a wider distribution of values. As argued in the initialization studies, such large value initializations can lead to poor training because of saturated outputs. Further, the larger embedding values can, in effect, absorb the smaller position encoding values when added together, thus losing position information. Standardizing the pre-trained embeddings to a narrow range (e.g. as prescribed by Xavier) leads to substantial gains for Glove, T5, and mT5 embeddings. On the other hand, BERT pre-trained embeddings, while larger, are still relatively closer to Xavier initialization range which may allow it to effectively transfer the pre-trained knowledge.
翻译:当前实践中,从零开始训练基于Transformer的模型时,普遍采用随机初始化方案而非预训练嵌入。我们发现,相较于随机初始化,来自GloVe的预训练词嵌入以及从T5、mT5等语言模型中提取的子词嵌入表现显著更差。考虑到预训练在表征学习和迁移学习方面的显著优势,这一现象有违直觉。有趣的是,BERT和mBERT嵌入的表现却优于随机初始化,这印证了预训练表征的优越性。本研究提出两个可能导致这种差异的因素:模型对参数分布的敏感性以及嵌入与位置编码的交互作用。我们观察到预训练的GloVe、T5和mT5嵌入具有更宽的值分布。正如初始化研究所述,此类大数值初始化可能因输出饱和而导致训练效果不佳。此外,较大的嵌入值在与较小的位置编码值相加时,实质上会"吸收"位置信息,从而导致位置信息丢失。将预训练嵌入标准化至较窄范围(例如遵循Xavier初始化准则)可使GloVe、T5和mT5嵌入获得显著提升。另一方面,BERT预训练嵌入虽然数值较大,但其分布仍相对接近Xavier初始化范围,这可能使其能够有效迁移预训练知识。