Crafting an effective Automatic Speech Recognition (ASR) solution for dialects demands innovative approaches that not only address the data scarcity issue but also navigate the intricacies of linguistic diversity. In this paper, we address the aforementioned ASR challenge, focusing on the Tunisian dialect. First, textual and audio data is collected and in some cases annotated. Second, we explore self-supervision, semi-supervision and few-shot code-switching approaches to push the state-of-the-art on different Tunisian test sets; covering different acoustic, linguistic and prosodic conditions. Finally, and given the absence of conventional spelling, we produce a human evaluation of our transcripts to avoid the noise coming from spelling inadequacies in our testing references. Our models, allowing to transcribe audio samples in a linguistic mix involving Tunisian Arabic, English and French, and all the data used during training and testing are released for public use and further improvements.
翻译:构建有效的方言自动语音识别(ASR)解决方案需要创新方法,不仅要解决数据稀缺问题,还需应对语言多样性的复杂性。本文针对突尼斯方言的ASR挑战展开研究。首先,我们收集了文本和音频数据,并在部分情况下进行了标注。其次,我们探索了自监督、半监督和小样本代码转换方法,以推动不同突尼斯测试集(涵盖多种声学、语言和韵律条件)的前沿性能。最后,鉴于缺乏传统正字法,我们对转录结果进行人工评估,以规避测试参考中因拼写不当造成的噪声。我们的模型能够转录包含突尼斯阿拉伯语、英语和法语的混合语言音频样本,所有训练和测试中使用的数据均已公开,以供进一步改进和使用。