Disentangling Prosody Representations with Unsupervised Speech Reconstruction

Human speech can be characterized by different components, including semantic content, speaker identity and prosodic information. Significant progress has been made in disentangling representations for semantic content and speaker identity in Automatic Speech Recognition (ASR) and speaker verification tasks respectively. However, it is still an open challenging research question to extract prosodic information because of the intrinsic association of different attributes, such as timbre and rhythm, and because of the need for supervised training schemes to achieve robust large-scale and speaker-independent ASR. The aim of this paper is to address the disentanglement of emotional prosody from speech based on unsupervised reconstruction. Specifically, we identify, design, implement and integrate three crucial components in our proposed speech reconstruction model Prosody2Vec: (1) a unit encoder that transforms speech signals into discrete units for semantic content, (2) a pretrained speaker verification model to generate speaker identity embeddings, and (3) a trainable prosody encoder to learn prosody representations. We first pretrain the Prosody2Vec representations on unlabelled emotional speech corpora, then fine-tune the model on specific datasets to perform Speech Emotion Recognition (SER) and Emotional Voice Conversion (EVC) tasks. Both objective (weighted and unweighted accuracies) and subjective (mean opinion score) evaluations on the EVC task suggest that Prosody2Vec effectively captures general prosodic features that can be smoothly transferred to other emotional speech. In addition, our SER experiments on the IEMOCAP dataset reveal that the prosody features learned by Prosody2Vec are complementary and beneficial for the performance of widely used speech pretraining models and surpass the state-of-the-art methods when combining Prosody2Vec with HuBERT representations.

翻译：人类语音可由不同成分表征，包括语义内容、说话人身份及韵律信息。在自动语音识别（ASR）与说话人验证任务中，针对语义内容与说话人身份的表征解耦已取得显著进展。然而，由于音色与节奏等不同属性的内在关联性，以及为实现鲁棒大规模、说话人无关的ASR需要监督训练方案，提取韵律信息仍是具有挑战性的开放研究问题。本文旨在基于无监督重建实现语音中情感韵律的解耦。具体而言，我们在所提出的语音重建模型Prosody2Vec中识别、设计、实现并整合三个关键组件：(1) 将语音信号转换为语义内容离散单元的单元编码器，(2) 用于生成说话人身份嵌入的预训练说话人验证模型，(3) 可训练的用于学习韵律表征的韵律编码器。我们首先在未标注的情感语音语料库上预训练Prosody2Vec表征，随后在特定数据集上微调模型以执行语音情感识别（SER）和情感语音转换（EVC）任务。对EVC任务的主客观评估（加权/未加权准确率及平均意见分）表明，Prosody2Vec能有效捕获可平滑迁移至其他情感语音的通用韵律特征。此外，在IEMOCAP数据集上的SER实验揭示，Prosody2Vec学习的韵律特征具有互补性，有助于提升广泛使用的语音预训练模型性能，且结合HuBERT表征时超越了现有最优方法。