Persian Typographical Error Type Detection Using Deep Neural Networks on Algorithmically-Generated Misspellings

Spelling correction is a remarkable challenge in the field of natural language processing. The objective of spelling correction tasks is to recognize and rectify spelling errors automatically. The development of applications that can effectually diagnose and correct Persian spelling and grammatical errors has become more important in order to improve the quality of Persian text. The Typographical Error Type Detection in Persian is a relatively understudied area. Therefore, this paper presents a compelling approach for detecting typographical errors in Persian texts. Our work includes the presentation of a publicly available dataset called FarsTypo, which comprises 3.4 million words arranged in chronological order and tagged with their corresponding part-of-speech. These words cover a wide range of topics and linguistic styles. We develop an algorithm designed to apply Persian-specific errors to a scalable portion of these words, resulting in a parallel dataset of correct and incorrect words. By leveraging FarsTypo, we establish a strong foundation and conduct a thorough comparison of various methodologies employing different architectures. Additionally, we introduce a groundbreaking Deep Sequential Neural Network that utilizes both word and character embeddings, along with bidirectional LSTM layers, for token classification aimed at detecting typographical errors across 51 distinct classes. Our approach is contrasted with highly advanced industrial systems that, unlike this study, have been developed using a diverse range of resources. The outcomes of our final method proved to be highly competitive, achieving an accuracy of 97.62%, precision of 98.83%, recall of 98.61%, and surpassing others in terms of speed.

翻译：拼写纠正是自然语言处理领域中一项显著的挑战。拼写纠正任务的目标是自动识别并修正拼写错误。为提升波斯语文本质量，开发能够有效诊断并纠正波斯语拼写及语法错误的应用程序变得日益重要。波斯语打字错误类型检测是一个相对研究不足的领域。因此，本文提出了一种令人瞩目的方法来检测波斯语文本中的打字错误。我们的工作包括发布一个名为FarsTypo的公开数据集，该数据集包含按时间顺序排列的340万个单词，并标注了相应的词性。这些词汇涵盖了广泛的主题和语言风格。我们开发了一种算法，旨在将波斯语特有的错误应用于这些单词中可扩展的部分，从而生成一个由正确和错误单词组成的平行数据集。通过利用FarsTypo，我们建立了坚实的基础，并对采用不同架构的多种方法进行了全面的比较。此外，我们引入了一种开创性的深度序列神经网络，该网络同时利用词嵌入和字符嵌入，以及双向LSTM层，进行面向51个不同类别的令牌分类，以检测打字错误。我们的方法与高度先进的工业系统进行了对比，这些系统与本项研究不同，是使用多种资源开发的。我们最终方法的结果极具竞争力，达到了97.62%的准确率、98.83%的精确率、98.61%的召回率，并在速度上超越了其他方法。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日