Recent advancements in text-to-speech (TTS) synthesis show that large-scale models trained with extensive web data produce highly natural-sounding output. However, such data is scarce for Indian languages due to the lack of high-quality, manually subtitled data on platforms like LibriVox or YouTube. To address this gap, we enhance existing large-scale ASR datasets containing natural conversations collected in low-quality environments to generate high-quality TTS training data. Our pipeline leverages the cross-lingual generalization of denoising and speech enhancement models trained on English and applied to Indian languages. This results in IndicVoices-R (IV-R), the largest multilingual Indian TTS dataset derived from an ASR dataset, with 1,704 hours of high-quality speech from 10,496 speakers across 22 Indian languages. IV-R matches the quality of gold-standard TTS datasets like LJSpeech, LibriTTS, and IndicTTS. We also introduce the IV-R Benchmark, the first to assess zero-shot, few-shot, and many-shot speaker generalization capabilities of TTS models on Indian voices, ensuring diversity in age, gender, and style. We demonstrate that fine-tuning an English pre-trained model on a combined dataset of high-quality IndicTTS and our IV-R dataset results in better zero-shot speaker generalization compared to fine-tuning on the IndicTTS dataset alone. Further, our evaluation reveals limited zero-shot generalization for Indian voices in TTS models trained on prior datasets, which we improve by fine-tuning the model on our data containing diverse set of speakers across language families. We open-source all data and code, releasing the first TTS model for all 22 official Indian languages.
翻译:近年来,文本到语音(TTS)合成领域的研究进展表明,利用海量网络数据训练的大规模模型能够生成高度自然的语音输出。然而,由于在LibriVox或YouTube等平台上缺乏高质量的人工字幕数据,此类数据对于印度语言而言极为稀缺。为弥补这一差距,我们对现有的、包含在低质量环境中采集的自然对话的大规模自动语音识别(ASR)数据集进行增强,以生成高质量的TTS训练数据。我们的处理流程利用了在英语数据上训练的去噪和语音增强模型的跨语言泛化能力,并将其应用于印度语言。由此产生的IndicVoices-R(IV-R)是从ASR数据集衍生出的最大的多语言印度TTS数据集,包含来自22种印度语言的10,496名说话人的1,704小时高质量语音。IV-R在质量上可与LJSpeech、LibriTTS和IndicTTS等黄金标准TTS数据集相媲美。我们还推出了IV-R基准测试,这是首个用于评估TTS模型在印度语音上的零样本、少样本和多样本说话人泛化能力的基准,确保了在年龄、性别和风格方面的多样性。我们证明,在高质量IndicTTS数据集与我们的IV-R数据集组合而成的数据集上对英语预训练模型进行微调,相较于仅在IndicTTS数据集上微调,能实现更好的零样本说话人泛化。此外,我们的评估揭示了在先前数据集上训练的TTS模型对印度语音的零样本泛化能力有限,而通过在包含跨语系多样化说话人的数据上微调模型,我们改善了这一点。我们开源了所有数据和代码,并发布了首个支持全部22种印度官方语言的TTS模型。