Persian Language, despite being spoken by over 100 million people worldwide, remains severely underrepresented in high-quality speech corpora, particularly for text-to-speech (TTS) synthesis applications. Existing Persian speech datasets are typically smaller than their English counterparts, which creates a key limitation for developing Persian speech technologies. We address this gap by introducing ParsVoice, the largest Persian speech corpus designed specifically for TTS applications. We created an automated pipeline that transforms raw audiobook content into TTS-ready data, incorporating components such as a BERT-based sentence completion detector, a binary search boundary optimization method for precise audio-text alignment, and multi-dimensional quality assessment frameworks tailored to Persian. The pipeline processes 2,000 audiobooks, yielding 3,526 hours of clean speech, which was further filtered into a 1,804-hour high-quality subset suitable for TTS, featuring more than 470 speakers. ParsVoice is the largest high-quality Persian speech dataset, offering speaker diversity and audio quality comparable to major English corpora. The complete dataset has been made publicly available to accelerate the development of Persian speech technologies and to serve as a template for other low-resource languages. The ParsVoice dataset is publicly available at ParsVoice (https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice).
翻译:波斯语虽在全球拥有超过一亿使用者,但在高质量语音语料库中仍严重缺乏代表性,尤其在文本到语音(TTS)合成应用领域。现有的波斯语语音数据集规模通常小于其英语对应物,这成为发展波斯语语音技术的关键限制。为填补这一空白,我们推出了专为TTS应用设计的最大规模波斯语语音语料库ParsVoice。我们构建了一套自动化流程,将原始有声读物内容转化为TTS就绪数据,该流程整合了基于BERT的句子完整性检测器、用于精确音频-文本对齐的二分搜索边界优化方法,以及针对波斯语定制的多维度质量评估框架。该流程处理了2,000本有声读物,产出3,526小时的纯净语音,并进一步筛选出适用于TTS的1,804小时高质量子集,涵盖超过470位说话人。ParsVoice是目前规模最大、质量最高的波斯语语音数据集,其说话人多样性与音频质量可与主流英语语料库相媲美。完整数据集已公开提供,以加速波斯语语音技术的发展,并为其他低资源语言提供构建模板。ParsVoice数据集公开发布于ParsVoice(https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice)。