Text normalization is a crucial technology for low-resource languages which lack rigid spelling conventions or that have undergone multiple spelling reforms. Low-resource text normalization has so far relied upon hand-crafted rules, which are perceived to be more data efficient than neural methods. In this paper we examine the case of text normalization for Ligurian, an endangered Romance language. We collect 4,394 Ligurian sentences paired with their normalized versions, as well as the first open source monolingual corpus for Ligurian. We show that, in spite of the small amounts of data available, a compact transformer-based model can be trained to achieve very low error rates by the use of backtranslation and appropriate tokenization.
翻译:文本规范化是缺乏严格拼写规范或经历多次拼写改革的低资源语言的一项关键技术。迄今为止,低资源语言的文本规范化依赖于手工制定的规则,这些规则被认为比神经方法更具数据效率。本文研究了利古里亚语(一种濒危罗曼语)的文本规范化案例。我们收集了4,394个利古里亚语句子及其规范化版本,并构建了首个公开的单语利古里亚语语料库。研究表明,尽管可用数据量较小,但通过反向翻译和适当的标记化方法,可以训练出紧凑的基于Transformer的模型,从而实现极低的错误率。