Cross-speaker emotion transfer in speech synthesis relies on extracting speaker-independent emotion embeddings for accurate emotion modeling without retaining speaker traits. However, existing timbre compression methods fail to fully separate speaker and emotion characteristics, causing speaker leakage and degraded synthesis quality. To address this, we propose DiEmo-TTS, a self-supervised distillation method to minimize emotional information loss and preserve speaker identity. We introduce cluster-driven sampling and information perturbation to preserve emotion while removing irrelevant factors. To facilitate this process, we propose an emotion clustering and matching approach using emotional attribute prediction and speaker embeddings, enabling generalization to unlabeled data. Additionally, we designed a dual conditioning transformer to integrate style features better. Experimental results confirm the effectiveness of our method in learning speaker-irrelevant emotion embeddings.
翻译:跨说话人语音合成中的情感迁移依赖于提取与说话人无关的情感嵌入,以便在不保留说话人特征的情况下进行精确的情感建模。然而,现有的音色压缩方法未能完全分离说话人和情感特征,导致说话人信息泄露和合成质量下降。为解决此问题,我们提出了DiEmo-TTS,一种自监督蒸馏方法,旨在最小化情感信息损失并保持说话人身份。我们引入了聚类驱动采样和信息扰动,以在去除无关因素的同时保留情感。为了促进这一过程,我们提出了一种利用情感属性预测和说话人嵌入进行情感聚类与匹配的方法,从而能够泛化到未标记数据。此外,我们设计了一个双重条件Transformer以更好地集成风格特征。实验结果证实了我们的方法在学习与说话人无关的情感嵌入方面的有效性。