Tashkeel, or Arabic Text Diacritization (ATD), greatly enhances the comprehension of Arabic text by removing ambiguity and minimizing the risk of misinterpretations caused by its absence. It plays a crucial role in improving Arabic text processing, particularly in applications such as text-to-speech and machine translation. This paper introduces a new approach to training ATD models. First, we finetuned two transformers, encoder-only and encoder-decoder, that were initialized from a pretrained character-based BERT. Then, we applied the Noisy-Student approach to boost the performance of the best model. We evaluated our models alongside 11 commercial and open-source models using two manually labeled benchmark datasets: WikiNews and our CATT dataset. Our findings show that our top model surpasses all evaluated models by relative Diacritic Error Rates (DERs) of 30.83\% and 35.21\% on WikiNews and CATT, respectively, achieving state-of-the-art in ATD. In addition, we show that our model outperforms GPT-4-turbo on CATT dataset by a relative DER of 9.36\%. We open-source our CATT models and benchmark dataset for the research community\footnote{https://github.com/abjadai/catt}.
翻译:阿拉伯语标音(或称阿拉伯文本标音,ATD)通过消除歧义并降低因缺失标音导致的误读风险,极大地提升了阿拉伯文本的可理解性。它在改进阿拉伯文本处理方面发挥着关键作用,特别是在文本转语音和机器翻译等应用中。本文提出了一种训练ATD模型的新方法。首先,我们微调了两个Transformer模型(仅编码器型和编码器-解码器型),这些模型均从预训练的基于字符的BERT初始化。随后,我们采用Noisy-Student方法来提升最佳模型的性能。我们使用两个人工标注的基准数据集(WikiNews和我们的CATT数据集),将我们的模型与11个商业及开源模型进行了对比评估。研究结果表明,我们的最优模型在WikiNews和CATT数据集上分别以30.83%和35.21%的相对标音错误率(DER)优势超越了所有评估模型,实现了ATD领域的先进性能。此外,我们证明我们的模型在CATT数据集上以9.36%的相对DER优势超越了GPT-4-turbo。我们已将CATT模型和基准数据集开源供研究社区使用\footnote{https://github.com/abjadai/catt}。