Bangla typing is mostly performed using English keyboard and can be highly erroneous due to the presence of compound and similarly pronounced letters. Spelling correction of a misspelled word requires understanding of word typing pattern as well as the context of the word usage. A specialized BERT model named BSpell has been proposed in this paper targeted towards word for word correction in sentence level. BSpell contains an end-to-end trainable CNN sub-model named SemanticNet along with specialized auxiliary loss. This allows BSpell to specialize in highly inflected Bangla vocabulary in the presence of spelling errors. Furthermore, a hybrid pretraining scheme has been proposed for BSpell that combines word level and character level masking. Comparison on two Bangla and one Hindi spelling correction dataset shows the superiority of our proposed approach. BSpell is available as a Bangla spell checking tool via GitHub: https://github.com/Hasiburshanto/Bangla-Spell-Checker
翻译:孟加拉语主要通过英语键盘输入,且由于存在复合字母和发音相似的字母,输入错误率极高。纠正拼写错误的单词需要理解单词的输入模式及单词使用的上下文。本文提出了一种名为BSpell的专用BERT模型,旨在实现句子级别的逐词纠错。BSpell包含一个名为SemanticNet的端到端可训练的CNN子模型,并辅以专用辅助损失函数。这使得BSpell能够在存在拼写错误的情况下对高度屈折变化的孟加拉语词汇进行专门化处理。此外,本文为BSpell提出了一种混合预训练方案,该方案结合了词级掩码和字符级掩码。在两个孟加拉语数据集和一个印地语拼写校正数据集上的比较表明,我们所提出的方法具有优越性。BSpell作为孟加拉语拼写检查工具可通过GitHub获取:https://github.com/Hasiburshanto/Bangla-Spell-Checker