Large-scale text-to-speech (TTS) systems are limited by the scarcity of clean, multilingual recordings. We introduce Sidon, a fast, open-source speech restoration model that converts noisy in-the-wild speech into studio-quality speech and scales to dozens of languages. Sidon consists of two models: w2v-BERT 2.0 finetuned feature predictor to cleanse features from noisy speech and vocoder trained to synthesize restored speech from the cleansed features. Sidon achieves restoration performance comparable to Miipher: Google's internal speech restoration model with the aim of dataset cleansing for speech synthesis. Sidon is also computationally efficient, running up to 500 times faster than real time on a single GPU. We further show that training a TTS model using a Sidon-cleansed automatic speech recognition corpus improves the quality of synthetic speech in a zero-shot setting. Code and model are released to facilitate reproducible dataset cleansing for the research community.
翻译:大规模文本到语音(TTS)系统受限于高质量、多语言录音数据的稀缺性。我们提出了Sidon,一个快速、开源的语音修复模型,能够将嘈杂的真实环境语音转换为录音室质量的语音,并可扩展到数十种语言。Sidon由两个模型组成:基于w2v-BERT 2.0微调的特征预测器,用于从带噪语音中提取纯净特征;以及一个声码器,用于根据净化后的特征合成修复语音。Sidon的修复性能与Miipher(谷歌内部用于语音合成数据集清洗的语音修复模型)相当。同时,Sidon计算效率高,在单GPU上运行速度可达实时速度的500倍。我们进一步证明,使用经Sidon清洗的自动语音识别语料库训练TTS模型,能在零样本设置下提升合成语音的质量。我们已发布代码和模型,以促进研究社区进行可复现的数据集清洗工作。