We present ArTST, a pre-trained Arabic text and speech transformer for supporting open-source speech technologies for the Arabic language. The model architecture follows the unified-modal framework, SpeechT5, that was recently released for English, and is focused on Modern Standard Arabic (MSA), with plans to extend the model for dialectal and code-switched Arabic in future editions. We pre-trained the model from scratch on MSA speech and text data, and fine-tuned it for the following tasks: Automatic Speech Recognition (ASR), Text-To-Speech synthesis (TTS), and spoken dialect identification. In our experiments comparing ArTST with SpeechT5, as well as with previously reported results in these tasks, ArTST performs on a par with or exceeding the current state-of-the-art in all three tasks. Moreover, we find that our pre-training is conducive for generalization, which is particularly evident in the low-resource TTS task. The pre-trained model as well as the fine-tuned ASR and TTS models are released for research use.
翻译:我们提出了ArTST,一个预训练的阿拉伯语文本与语音Transformer,旨在支持阿拉伯语的开源语音技术。模型架构遵循近期为英语发布的统一模态框架SpeechT5,专注于现代标准阿拉伯语(MSA),并计划在后续版本中扩展至方言及语码混合阿拉伯语。我们从头开始在MSA语音和文本数据上预训练该模型,并针对以下任务进行微调:自动语音识别(ASR)、文本转语音合成(TTS)及口语方言识别。在将ArTST与SpeechT5及此前报道的上述任务结果进行对比的实验中,ArTST在三项任务中均达到或超越了当前最优性能。此外,我们发现预训练有利于泛化,这在低资源TTS任务中尤为显著。预训练模型以及微调后的ASR和TTS模型已发布供研究使用。