Tashkeel, or Arabic Text Diacritization (ATD), greatly enhances the comprehension of Arabic text by removing ambiguity and minimizing the risk of misinterpretations caused by its absence. It plays a crucial role in improving Arabic text processing, particularly in applications such as text-to-speech and machine translation. This paper introduces a new approach to training ATD models. First, we finetuned two transformers, encoder-only and encoder-decoder, that were initialized from a pretrained character-based BERT. Then, we applied the Noisy-Student approach to boost the performance of the best model. We evaluated our models alongside 11 commercial and open-source models using two manually labeled benchmark datasets: WikiNews and our CATT dataset. Our findings show that our top model surpasses all evaluated models by relative Diacritic Error Rates (DERs) of 30.83\% and 35.21\% on WikiNews and CATT, respectively, achieving state-of-the-art in ATD. In addition, we show that our model outperforms GPT-4-turbo on CATT dataset by a relative DER of 9.36\%. We open-source our CATT models and benchmark dataset for the research community\footnote{https://github.com/abjadai/catt}.
翻译:变音符号标注(亦称阿拉伯语文本变音标注,ATD)通过消除歧义并降低因缺省标注导致的误读风险,显著提升了阿拉伯语文本的可理解性。该技术对改进阿拉伯语文本处理具有关键作用,尤其在文本转语音和机器翻译等应用中。本文提出了一种训练ATD模型的新方法:首先,我们从预训练的基于字符的BERT初始化了两个Transformer架构(纯编码器型与编码器-解码器型)并进行微调;随后,采用Noisy-Student方法进一步提升最优模型的性能。我们使用两个人工标注的基准数据集(WikiNews和自建的CATT数据集),将所提模型与11个商业及开源模型进行了对比评估。实验结果表明:在WikiNews和CATT数据集上,我们的最优模型分别以30.83%和35.21%的相对变音错误率(DER)优势超越所有被评估模型,实现了ATD领域的先进性能。此外,我们的模型在CATT数据集上以9.36%的相对DER优势超越了GPT-4-turbo。我们已将CATT模型及基准数据集开源供研究社区使用\footnote{https://github.com/abjadai/catt}。