Digital technologies have led to an influx of text created daily in a variety of languages, styles, and formats. A great deal of the popularity of spell-checking systems can be attributed to this phenomenon since they are crucial to polishing the digitally conceived text. In this study, we tackle Typographical Error Type Detection in Persian, which has been relatively understudied. In this paper, we present a public dataset named FarsTypo, containing 3.4 million chronologically ordered and part-of-speech tagged words of diverse topics and linguistic styles. An algorithm for applying Persian-specific errors is developed and applied to a scalable size of these words, forming a parallel dataset of correct and incorrect words. Using FarsTypo, we establish a firm baseline and compare different methodologies using various architectures. In addition, we present a novel Many-to-Many Deep Sequential Neural Network to perform token classification using both word and character embeddings in combination with bidirectional LSTM layers to detect typographical errors across 51 classes. We compare our approach with highly-advanced industrial systems that, unlike this study, have been developed utilizing a variety of resources. The results of our final method were competitive in that we achieved an accuracy of 97.62%, a precision of 98.83%, a recall of 98.61%, and outperformed the rest in terms of speed.
翻译:数字技术导致每天以多种语言、风格和格式生成大量文本。拼写检查系统在润色数字文本中至关重要,其普及性很大程度上归因于此现象。本研究聚焦于相对研究不足的波斯语拼写错误类型检测问题。本文提出了一个名为FarsTypo的公开数据集,包含340万个按时间顺序排列且标注词性的单词,涵盖多样主题与语言风格。我们开发了一种针对波斯语特有错误的算法,并将其应用于可扩展规模的单词样本,构成正确与错误单词的平行数据集。利用FarsTypo,我们建立了坚实基准,并比较了采用不同架构的多种方法。此外,我们提出了一种新颖的多对多深度序列神经网络,通过结合词嵌入与字符嵌入的双向LSTM层,对51个类别进行词级分类以检测拼写错误。我们将本方法与高度先进的工业系统进行对比——这些系统(不同于本研究)利用多种资源开发而成。最终方法取得了具有竞争力的结果:准确率为97.62%,精确率为98.83%,召回率为98.61%,并在速度上超越其他方法。