Existing Persian speech datasets are typically smaller than their English counterparts, which creates a key limitation for developing Persian speech technologies. We address this gap by introducing ParsVoice, the largest Persian speech corpus designed specifically for text-to-speech(TTS) applications. We created an automated pipeline that transforms raw audiobook content into TTS-ready data, incorporating components such as a BERT-based sentence completion detector, a binary search boundary optimization method for precise audio-text alignment, and audio-text quality assessment frameworks tailored to Persian. The pipeline processes 2,000 audiobooks, yielding 3,526 hours of clean speech, which was further filtered into a 1,804-hour high-quality subset suitable for TTS, featuring more than 470 speakers. To validate the dataset, we fine-tuned XTTS for Persian, achieving a naturalness Mean Opinion Score (MOS) of 3.6/5 and a Speaker Similarity Mean Opinion Score (SMOS) of 4.0/5 demonstrating ParsVoice's effectiveness for training multi-speaker TTS systems. ParsVoice is the largest high-quality Persian speech dataset, offering speaker diversity and audio quality comparable to major English corpora. The complete dataset has been made publicly available to accelerate the development of Persian speech technologies. The ParsVoice dataset is publicly available at: https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice.
翻译:现有的波斯语语音数据集通常规模小于其英语对应物,这对发展波斯语语音技术构成了关键限制。我们通过引入ParsVoice来弥补这一差距,这是专为文本到语音(TTS)应用设计的最大规模波斯语语音语料库。我们创建了一个自动化流程,将原始有声读物内容转换为可用于TTS的数据,该流程整合了多个组件,包括基于BERT的句子补全检测器、用于精确音频-文本对齐的二分搜索边界优化方法,以及针对波斯语定制的音频-文本质量评估框架。该流程处理了2,000本有声读物,产生了3,526小时的纯净语音,并进一步过滤出一个1,804小时的高质量子集,适用于TTS,包含超过470位说话人。为验证数据集,我们对XTTS进行了波斯语微调,获得了3.6/5的自然度平均意见得分(MOS)和4.0/5的说话人相似度平均意见得分(SMOS),这证明了ParsVoice在训练多说话人TTS系统方面的有效性。ParsVoice是最大规模的高质量波斯语语音数据集,其说话人多样性和音频质量可与主要英语语料库相媲美。完整数据集已公开发布,以加速波斯语语音技术的发展。ParsVoice数据集公开发布于:https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice。