For fine-grained generation and recognition tasks such as minimally-supervised text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), the intermediate representation extracted from speech should contain information that is between text coding and acoustic coding. The linguistic content is salient, while the paralinguistic information such as speaker identity and acoustic details should be removed. However, existing methods for extracting fine-grained intermediate representations from speech suffer from issues of excessive redundancy and dimension explosion. Additionally, existing contrastive learning methods in the audio field focus on extracting global descriptive information for downstream audio classification tasks, making them unsuitable for TTS, VC, and ASR tasks. To address these issues, we propose a method named Contrastive Phoneme-Speech Pretraining (CPSP), which uses three encoders, one decoder, and contrastive learning to bring phoneme and speech into a joint multimodal space, learning how to connect phoneme and speech at the frame level. The CPSP model is trained on 210k speech and phoneme text pairs, achieving minimally-supervised TTS, VC, and ASR. The proposed CPSP method offers a promising solution for fine-grained generation and recognition downstream tasks in speech processing. We provide a website with audio samples.
翻译:摘要:对于细粒度生成与识别任务,如最小监督文本转语音(TTS)、语音转换(VC)和自动语音识别(ASR),从语音中提取的中间表示应包含介于文本编码与声学编码之间的信息。语言内容需突出,而副语言信息(如说话人身份和声学细节)则应被去除。然而,现有从语音中提取细粒度中间表示的方法存在冗余过多和维度爆炸的问题。此外,音频领域现有的对比学习方法侧重于提取全局描述性信息以用于下游音频分类任务,不适用于TTS、VC和ASR任务。针对这些问题,我们提出了一种名为对比音素-语音预训练(CPSP)的方法,该方法采用三个编码器、一个解码器以及对比学习,将音素和语音映射到联合多模态空间,学习如何在帧级别连接音素和语音。CPSP模型在21万对语音-音素文本数据上进行训练,实现了最小监督的TTS、VC和ASR。所提出的CPSP方法为语音处理中的细粒度生成与识别下游任务提供了一种有前景的解决方案。我们提供了一个包含音频样本的网站。