One of the challenges with finetuning pretrained language models (PLMs) is that their tokenizer is optimized for the language(s) it was pretrained on, but brittle when it comes to previously unseen variations in the data. This can for instance be observed when finetuning PLMs on one language and evaluating them on data in a closely related language variety with no standardized orthography. Despite the high linguistic similarity, tokenization no longer corresponds to meaningful representations of the target data, leading to low performance in, e.g., part-of-speech tagging. In this work, we finetune PLMs on seven languages from three different families and analyze their zero-shot performance on closely related, non-standardized varieties. We consider different measures for the divergence in the tokenization of the source and target data, and the way they can be adjusted by manipulating the tokenization during the finetuning step. Overall, we find that the similarity between the percentage of words that get split into subwords in the source and target data (the split word ratio difference) is the strongest predictor for model performance on target data.
翻译:微调预训练语言模型(PLMs)时面临的挑战之一在于,其分词器针对预训练语言进行了优化,但在应对数据中先前未见过的变异时表现脆弱。例如,当对一种语言的PLMs进行微调,并在缺乏标准正字法的密切关联语言变体数据上进行评估时,便可观察到这一现象。尽管语言相似性很高,但分词结果不再对应于目标数据的有意义表征,从而导致词性标注等任务的性能低下。本研究对来自三个不同语系的七种语言的PLMs进行微调,并分析其在密切关联的非标准化语言变体上的零样本性能。我们考虑了源数据和目标数据分词差异的不同度量指标,以及如何在微调阶段通过操控分词来调整这些差异。总体而言,我们发现源数据和目标数据中单词被分割为子词的百分比相似度(分割词比率差异)是模型在目标数据上性能的最强预测因子。