This paper discusses two approaches to the diachronic normalization of Polish texts: a rule-based solution that relies on a set of handcrafted patterns, and a neural normalization model based on the text-to-text transfer transformer architecture. The training and evaluation data prepared for the task are discussed in detail, along with experiments conducted to compare the proposed normalization solutions. A quantitative and qualitative analysis is made. It is shown that at the current stage of inquiry into the problem, the rule-based solution outperforms the neural one on 3 out of 4 variants of the prepared dataset, although in practice both approaches have distinct advantages and disadvantages.
翻译:本文讨论了波兰语文本历时归一化的两种方法:一种基于手工编写规则模式的解决方案,以及一种基于文本到文本转换Transformer架构的神经归一化模型。详细阐述了为此任务准备的训练与评估数据,以及用于比较所提归一化方案的实验。进行了定量与定性分析。研究表明,在当前问题探究阶段,规则基方法在已构建数据集的4个变体中的3个上表现优于神经方法,但实践中两种方法各有其独特优势与局限。