Spelling correction is a remarkable challenge in the field of natural language processing. The objective of spelling correction tasks is to recognize and rectify spelling errors automatically. The development of applications that can effectually diagnose and correct Persian spelling and grammatical errors has become more important in order to improve the quality of Persian text. The Typographical Error Type Detection in Persian is a relatively understudied area. Therefore, this paper presents a compelling approach for detecting typographical errors in Persian texts. Our work includes the presentation of a publicly available dataset called FarsTypo, which comprises 3.4 million words arranged in chronological order and tagged with their corresponding part-of-speech. These words cover a wide range of topics and linguistic styles. We develop an algorithm designed to apply Persian-specific errors to a scalable portion of these words, resulting in a parallel dataset of correct and incorrect words. By leveraging FarsTypo, we establish a strong foundation and conduct a thorough comparison of various methodologies employing different architectures. Additionally, we introduce a groundbreaking Deep Sequential Neural Network that utilizes both word and character embeddings, along with bidirectional LSTM layers, for token classification aimed at detecting typographical errors across 51 distinct classes. Our approach is contrasted with highly advanced industrial systems that, unlike this study, have been developed using a diverse range of resources. The outcomes of our final method proved to be highly competitive, achieving an accuracy of 97.62%, precision of 98.83%, recall of 98.61%, and surpassing others in terms of speed.
翻译:拼写纠错是自然语言处理领域的一项重大挑战。拼写纠错任务的目标是自动识别并修正拼写错误。为提升波斯语文本质量,开发能够有效诊断并修正波斯语拼写与语法错误的应用程序变得愈发重要。当前,波斯语拼写错误类型检测领域的研究相对薄弱。为此,本文提出了一种检测波斯语文本中拼写错误的有效方法。我们的工作包括发布一个名为FarsTypo的公开数据集,该数据集包含340万个按时间顺序排列且标注了词性的单词,覆盖了广泛的主题和语言风格。我们设计了一种算法,将波斯语特有的错误类型应用于这些单词的可扩展子集,从而构建了一个由正确与错误单词组成的平行数据集。通过利用FarsTypo数据集,我们奠定了坚实的基础,并对采用不同架构的多种方法进行了全面比较。此外,我们提出了一种创新的深度序列神经网络模型,该模型结合词嵌入与字符嵌入,并采用双向LSTM层,针对51个不同类别进行令牌级分类,以检测拼写错误。本研究的方法与高度先进的工业系统形成对比——后者虽利用多样资源开发,但未采用本研究的方案。最终方法的实验结果极具竞争力,准确率达到97.62%,精确率为98.83%,召回率为98.61%,且在速度上超越其他方法。