Speech Emotion Conversion (SEC) aims to transform the emotion of a source utterance into a target emotion while preserving content and speaker identity. SEC on in-the-wild data is challenging due to the non-parallel nature of training data and complex real-world acoustics. Existing fixed-duration approaches either struggle to shift the emotion effectively (high quality, low conversion) or degrade speech naturalness (low quality, high conversion). We propose TargetSEC, an embedding-driven latent diffusion framework that generates emotion-focused style embeddings conditioned on speaker identity and continuous emotion. Unlike methods that diffuse over spectrograms, TargetSEC operates in a compact latent space. Experiments on the MSP-Podcast dataset show that TargetSEC outperforms current non-duration baselines in conversion accuracy while maintaining high speech quality, and achieves performance comparable to duration-prediction systems without explicit temporal modeling.
翻译:语音情感转换旨在将源语音的情感转换为目标情感,同时保留语言内容和说话人身份。在野外数据上进行情感转换极具挑战性,因为训练数据非平行且真实声学环境复杂。现有固定时长方法要么难以有效迁移情感(高质量低转换率),要么降低语音自然度(低质量高转换率)。本文提出TargetSEC框架——一种基于嵌入驱动的潜在扩散框架,通过说话人身份与连续情感条件生成情感导向的风格嵌入。与在声谱图上进行扩散的方法不同,TargetSEC在紧凑的潜在空间中运行。在MSP-Podcast数据集上的实验表明,TargetSEC在保持高语音质量的同时,转换准确率超越现有非时长基线方法,且无需显式时序建模即可达到与时长预测系统相当的性能。