While many recent any-to-any voice conversion models succeed in transferring some target speech's style information to the converted speech, they still lack the ability to faithfully reproduce the speaking style of the target speaker. In this work, we propose a novel method to extract rich style information from target utterances and to efficiently transfer it to source speech content without requiring text transcriptions or speaker labeling. Our proposed approach introduces an attention mechanism utilizing a self-supervised learning (SSL) model to collect the speaking styles of a target speaker each corresponding to the different phonetic content. The styles are represented with a set of embeddings called stylebook. In the next step, the stylebook is attended with the source speech's phonetic content to determine the final target style for each source content. Finally, content information extracted from the source speech and content-dependent target style embeddings are fed into a diffusion-based decoder to generate the converted speech mel-spectrogram. Experiment results show that our proposed method combined with a diffusion-based generative model can achieve better speaker similarity in any-to-any voice conversion tasks when compared to baseline models, while the increase in computational complexity with longer utterances is suppressed.
翻译:尽管近期许多任意对任意语音转换模型能够将目标语音的部分风格信息迁移至转换后的语音中,但其仍难以忠实复现目标说话人的说话风格。本文提出一种新颖方法,旨在从目标语句中提取丰富的风格信息,并将其高效迁移至源语音内容中,全程无需文本转录或说话人标注。该方法引入基于自监督学习(SSL)模型的注意力机制,收集目标说话人对应不同语音内容的逐项说话风格。这些风格通过一组名为Stylebook的嵌入向量进行表征。随后,通过源语音的音素内容对Stylebook施加注意力机制,确定每个源内容对应的最终目标风格。最后,将源语音提取的内容信息与基于内容的风格嵌入输入至基于扩散模型的解码器,生成转换后的语音梅尔频谱图。实验结果表明,与基线模型相比,本文方法结合扩散生成模型可在任意对任意语音转换任务中实现更优的说话人相似度,同时有效抑制长语句带来的计算复杂度增长。