Recent advancements in Text-to-Speech (TTS) technology have led to natural-sounding speech for English, primarily due to the availability of large-scale, high-quality web data. However, many other languages lack access to such resources, relying instead on limited studio-quality data. This scarcity results in synthesized speech that often suffers from intelligibility issues, particularly with low-frequency character bigrams. In this paper, we propose three solutions to address this challenge. First, we leverage high-quality data from linguistically or geographically related languages to improve TTS for the target language. Second, we utilize low-quality Automatic Speech Recognition (ASR) data recorded in non-studio environments, which is refined using denoising and speech enhancement models. Third, we apply knowledge distillation from large-scale models using synthetic data to generate more robust outputs. Our experiments with Hindi demonstrate significant reductions in intelligibility issues, as validated by human evaluators. We propose this methodology as a viable alternative for languages with limited access to high-quality data, enabling them to collectively benefit from shared resources.
翻译:近年来,文本转语音(TTS)技术的进步已使英语合成语音达到自然流畅的水平,这主要得益于大规模高质量网络数据的可用性。然而,许多其他语言缺乏此类资源,只能依赖有限的录音室质量数据。这种数据稀缺性导致合成语音常出现可懂度问题,在处理低频字符二元组时尤为明显。本文提出了三种解决方案以应对这一挑战。首先,我们利用语言或地理相关语言的高质量数据来改进目标语言的TTS性能。其次,我们采用在非录音室环境下录制的低质量自动语音识别(ASR)数据,并通过去噪和语音增强模型对其进行精炼。第三,我们应用基于合成数据的大规模模型知识蒸馏来生成更鲁棒的输出。在印地语上的实验表明,该方法能显著减少可懂度问题,这一结果已通过人工评估验证。我们提出该方法可作为高质量数据受限语言的可行替代方案,使它们能够共同受益于共享资源。