For fine-grained generation and recognition tasks such as minimally-supervised text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), the intermediate representations extracted from speech should serve as a "bridge" between text and acoustic information, containing information from both modalities. The semantic content is emphasized, while the paralinguistic information such as speaker identity and acoustic details should be de-emphasized. However, existing methods for extracting fine-grained intermediate representations from speech suffer from issues of excessive redundancy and dimension explosion. Contrastive learning is a good method for modeling intermediate representations from two modalities. However, existing contrastive learning methods in the audio field focus on extracting global descriptive information for downstream audio classification tasks, making them unsuitable for TTS, VC, and ASR tasks. To address these issues, we propose a method named "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space, learning how to connect phoneme and speech at the frame level. The CTAP model is trained on 210k speech and phoneme text pairs, achieving minimally-supervised TTS, VC, and ASR. The proposed CTAP method offers a promising solution for fine-grained generation and recognition downstream tasks in speech processing.
翻译:针对细粒度生成与识别任务(如最小监督文本转语音(TTS)、语音转换(VC)和自动语音识别(ASR)),从语音中提取的中间表示应充当文本与声学信息之间的“桥梁”,包含两种模态的信息。语义内容需要被强调,而副语言信息(如说话人身份和声学细节)应被淡化。然而,现有从语音中提取细粒度中间表示的方法存在冗余过度和维度爆炸的问题。对比学习是建模两种模态中间表示的有效方法,但现有音频领域的对比学习方法侧重于提取全局描述性信息用于下游音频分类任务,不适用于TTS、VC和ASR任务。为解决这些问题,我们提出一种名为“对比性标记-声学预训练(CTAP)”的方法,该方法使用两个编码器将音素与语音映射到联合多模态空间,学习如何在帧级连接音素与语音。CTAP模型在21万对语音-音素文本对上进行训练,实现了最小监督的TTS、VC和ASR。所提出的CTAP方法为语音处理中的细粒度生成与识别下游任务提供了一种有前景的解决方案。