Sidon：面向大规模数据集清洗的快速鲁棒开源多语言语音修复系统 (Sidon: Fast and Robust Open-Source Multilingual Speech Restoration for Large-scale Dataset Cleansing)

Large-scale text-to-speech (TTS) systems are limited by the scarcity of clean, multilingual recordings. We introduce Sidon, a fast, open-source speech restoration model that converts noisy in-the-wild speech into studio-quality speech and scales to dozens of languages. Sidon consists of two models: w2v-BERT 2.0 finetuned feature predictor to cleanse features from noisy speech and vocoder trained to synthesize restored speech from the cleansed features. Sidon achieves restoration performance comparable to Miipher: Google's internal speech restoration model with the aim of dataset cleansing for speech synthesis. Sidon is also computationally efficient, running up to 500 times faster than real time on a single GPU. We further show that training a TTS model using a Sidon-cleansed automatic speech recognition corpus improves the quality of synthetic speech in a zero-shot setting. Code and model are released to facilitate reproducible dataset cleansing for the research community.

翻译：大规模文本到语音（TTS）系统受限于高质量、多语言录音数据的稀缺性。我们提出了Sidon，一个快速、开源的语音修复模型，能够将嘈杂的真实环境语音转换为录音室质量的语音，并可扩展到数十种语言。Sidon由两个模型组成：基于w2v-BERT 2.0微调的特征预测器，用于从带噪语音中提取纯净特征；以及一个声码器，用于根据净化后的特征合成修复语音。Sidon的修复性能与Miipher（谷歌内部用于语音合成数据集清洗的语音修复模型）相当。同时，Sidon计算效率高，在单GPU上运行速度可达实时速度的500倍。我们进一步证明，使用经Sidon清洗的自动语音识别语料库训练TTS模型，能在零样本设置下提升合成语音的质量。我们已发布代码和模型，以促进研究社区进行可复现的数据集清洗工作。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

降解语音：通过输入操控实现鲁棒性语音转换的全面综述

专知会员服务

12+阅读 · 1月28日

迈向可控语音合成：大语言模型时代的综述

专知会员服务

23+阅读 · 2024年12月13日

《语音大语言模型》最新进展综述

专知会员服务

57+阅读 · 2024年10月8日

大型语言模型对齐技术综述：RLHF、RLAIF、PPO、DPO 等

专知会员服务

55+阅读 · 2024年7月24日