Pre-trained language models (PLMs) have shown remarkable successes in acquiring a wide range of linguistic knowledge, relying solely on self-supervised training on text streams. Nevertheless, the effectiveness of this language-agnostic approach has been frequently questioned for its sub-optimal performance when applied to morphologically-rich languages (MRLs). We investigate the hypothesis that incorporating explicit morphological knowledge in the pre-training phase can improve the performance of PLMs for MRLs. We propose various morphologically driven tokenization methods enabling the model to leverage morphological cues beyond raw text. We pre-train multiple language models utilizing the different methods and evaluate them on Hebrew, a language with complex and highly ambiguous morphology. Our experiments show that morphologically driven tokenization demonstrates improved results compared to a standard language-agnostic tokenization, on a benchmark of both semantic and morphologic tasks. These findings suggest that incorporating morphological knowledge holds the potential for further improving PLMs for morphologically rich languages.
翻译:预训练语言模型通过仅依赖文本流的自监督训练,在获取广泛语言学知识方面取得了显著成功。然而,这种语言无关的方法在应用于形态丰富语言时,其效果常因其次优性能受到质疑。我们探究了在预训练阶段融入显式形态学知识能否提升形态丰富语言模型的性能。我们提出了多种形态驱动的分词方法,使模型能够利用超出原始文本的形态线索。我们采用不同方法预训练了多个语言模型,并在形态复杂且高度歧义的希伯来语上进行了评估。实验结果表明,与标准语言无关的分词方法相比,形态驱动的分词在语义和形态任务基准测试中均展现出更优结果。这些发现表明,融入形态学知识具有进一步提升形态丰富语言模型性能的潜力。