Orthographic variation is very common in Luxembourgish texts due to the absence of a fully-fledged standard variety. Additionally, developing NLP tools for Luxembourgish is a difficult task given the lack of annotated and parallel data, which is exacerbated by ongoing standardization. In this paper, we propose the first sequence-to-sequence normalization models using the ByT5 and mT5 architectures with training data obtained from word-level real-life variation data. We perform a fine-grained, linguistically-motivated evaluation to test byte-based, word-based and pipeline-based models for their strengths and weaknesses in text normalization. We show that our sequence model using real-life variation data is an effective approach for tailor-made normalization in Luxembourgish.
翻译:由于缺乏成熟的标准变体,卢森堡语文本中存在大量拼写变异。此外,由于缺乏标注数据与平行数据,为卢森堡语开发自然语言处理工具本就困难,而持续进行的标准化进程进一步加剧了这一挑战。本文首次提出采用ByT5与mT5架构的序列到序列规范化模型,其训练数据来源于单词层级的真实变异数据。我们通过细粒度、基于语言学动机的评估方法,系统检验了字节级模型、单词级模型及流水线模型在文本规范化任务中的优势与局限。实验结果表明,采用真实变异数据的序列模型能够为卢森堡语实现定制化的高效文本规范化。