Orthographic variation is very common in Luxembourgish texts due to the absence of a fully-fledged standard variety. Additionally, developing NLP tools for Luxembourgish is a difficult task given the lack of annotated and parallel data, which is exacerbated by ongoing standardization. In this paper, we propose the first sequence-to-sequence normalization models using the ByT5 and mT5 architectures with training data obtained from word-level real-life variation data. We perform a fine-grained, linguistically-motivated evaluation to test byte-based, word-based and pipeline-based models for their strengths and weaknesses in text normalization. We show that our sequence model using real-life variation data is an effective approach for tailor-made normalization in Luxembourgish.
翻译:由于缺乏成熟的标准变体,卢森堡语文本中普遍存在正字法变异。此外,鉴于标注数据与平行数据的匮乏,为卢森堡语开发自然语言处理工具本已困难,而持续进行的标准化进程更使这一问题加剧。本文首次提出采用基于单词级真实变体数据训练的ByT5与mT5架构的序列到序列规范化模型。我们通过细粒度、语言学驱动的评估方法,系统检验了基于字节、基于单词及基于流水线的模型在文本规范化任务中的优势与局限。实验表明,利用真实变体数据构建的序列模型是实现卢森堡语定制化规范化的有效途径。