With the rise of generative text-to-speech models, distinguishing between real and synthetic speech has become challenging, especially for Arabic that have received limited research attention. Most spoof detection efforts have focused on English, leaving a significant gap for Arabic and its many dialects. In this work, we introduce the first multi-dialect Arabic spoofed speech dataset. To evaluate the difficulty of the synthesized audio from each model and determine which produces the most challenging samples, we aimed to guide the construction of our final dataset either by merging audios from multiple models or by selecting the best-performing model, we conducted an evaluation pipeline that included training classifiers using two approaches: modern embedding-based methods combined with classifier heads; classical machine learning algorithms applied to MFCC features; and the RawNet2 architecture. The pipeline further incorporated the calculation of Mean Opinion Score based on human ratings, as well as processing both original and synthesized datasets through an Automatic Speech Recognition model to measure the Word Error Rate. Our results demonstrate that FishSpeech outperforms other TTS models in Arabic voice cloning on the Casablanca corpus, producing more realistic and challenging synthetic speech samples. However, relying on a single TTS for dataset creation may limit generalizability.
翻译:随着生成式文本到语音模型的兴起,区分真实语音与合成语音变得极具挑战性,尤其对研究关注有限的阿拉伯语而言更是如此。现有欺骗检测研究大多聚焦于英语,为阿拉伯语及其众多方言留下了显著空白。本文首次提出多方言阿拉伯语欺骗语音数据集。为评估各模型合成音频的难度并确定最具挑战性的样本生成源,我们旨在通过合并多种模型音频或筛选最佳性能模型来指导最终数据集的构建,为此设计了一套评估流程:采用基于现代嵌入方法结合分类头、基于MFCC特征的经典机器学习算法以及RawNet2架构三种途径训练分类器。该流程进一步纳入基于人工评分的平均意见分计算,并通过自动语音识别模型处理原始与合成数据集以测量词错误率。实验结果表明,FishSpeech在卡萨布兰卡语料库的阿拉伯语语音克隆任务上优于其他TTS模型,能生成更逼真且更具挑战性的合成语音样本。但仅依赖单一TTS模型构建数据集可能限制泛化能力。