We propose a highly controllable voice manipulation system that can perform any-to-any voice conversion (VC) and prosody modulation simultaneously. State-of-the-art VC systems can transfer sentence-level characteristics such as speaker, emotion, and speaking style. However, manipulating the frame-level prosody, such as pitch, energy and speaking rate, still remains challenging. Our proposed model utilizes a frame-level prosody feature to effectively transfer such properties. Specifically, pitch and energy trajectories are integrated in a prosody conditioning module and then fed alongside speaker and contents embeddings to a diffusion-based decoder generating a converted speech mel-spectrogram. To adjust the speaking rate, our system includes a self-supervised model based post-processing step which allows improved controllability. The proposed model showed comparable speech quality and improved intelligibility compared to a SOTA approach. It can cover a varying range of fundamental frequency (F0), energy and speed modulation while maintaining converted speech quality.
翻译:本文提出一种高度可控的语音操控系统,可同时实现任意到任意语音转换(VC)与韵律调制。现有最优的VC系统能够迁移句子级特征(如说话人、情感及说话风格),但帧级韵律操控(如基频、能量和语速)仍具挑战性。本模型通过帧级韵律特征有效传递这些属性:具体而言,在韵律条件模块中整合基频与能量轨迹,并与说话人及内容嵌入向量共同输入基于扩散的解码器以生成转换后的语音梅尔频谱图。为调节语速,系统引入基于自监督模型的后处理步骤以提升可控性。相较于当前最优方法,本模型在保持可比拟语音质量的同时显著提升了可理解性,能够覆盖基频(F0)、能量及速度的广泛调制范围且不损害转换语音质量。