Effectively normalizing textual data poses a considerable challenge, especially for low-resource languages lacking standardized writing systems. In this study, we fine-tuned a multilingual model with data from several Occitan dialects and conducted a series of experiments to assess the model's representations of these dialects. For evaluation purposes, we compiled a parallel lexicon encompassing four Occitan dialects. Intrinsic evaluations of the model's embeddings revealed that surface similarity between the dialects strengthened representations. When the model was further fine-tuned for part-of-speech tagging and Universal Dependency parsing, its performance was robust to dialectical variation, even when trained solely on part-of-speech data from a single dialect. Our findings suggest that large multilingual models minimize the need for spelling normalization during pre-processing.
翻译:有效规范化文本数据是一个重大挑战,尤其对于缺乏标准化书写系统的低资源语言而言。本研究通过使用多个奥克语方言的数据微调多语言模型,并开展一系列实验评估模型对这些方言的表征能力。为进行评测,我们构建了一个涵盖四种奥克语方言的平行词汇表。对模型嵌入向量的内在评估表明,方言间的表层相似性增强了表征能力。当模型进一步针对词性标注和通用依存句法分析进行微调时,其性能对方言变体表现出鲁棒性,即便仅使用单一方言的词性数据进行训练。我们的研究结果表明,大型多语言模型可最大限度减少预处理阶段对拼写规范化的需求。