Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data. In speech, text-to-speech synthesis and neural vocoder are good examples where generative models have shined. While generative models have been applied to different applications in speech, there exists no general-purpose generative model that models speech directly. In this work, we take a step toward this direction by showing a single pre-trained generative model can be adapted to different downstream tasks with strong performance. Specifically, we pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions. Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis. Our work suggested a foundational model for generation tasks in speech can be built with generative pre-training.
翻译:近年来,生成模型在需要估计和采样数据分布以生成高保真合成数据的任务中取得了显著成功,因而受到越来越多的关注。在语音领域,文本到语音合成和神经声码器是生成模型发挥重要作用的典型例子。尽管生成模型已应用于语音领域的多种任务,但目前尚无直接对语音进行建模的通用生成模型。在这项工作中,我们朝着这一方向迈出了一步,展示了单个预训练生成模型能够适配到不同的下游任务并表现出色。具体而言,我们使用流匹配和掩码条件,在6万小时未转录语音上预训练了一个名为SpeechFlow的生成模型。实验结果表明,该预训练生成模型可以通过任务特定数据进行微调,在语音增强、分离和合成任务上达到或超越现有专家模型的表现。我们的工作表明,通过生成式预训练可以构建语音生成任务的基础模型。