While many recent any-to-any voice conversion models succeed in transferring some target speech's style information to the converted speech, they still lack the ability to faithfully reproduce the speaking style of the target speaker. In this work, we propose a novel method to extract rich style information from target utterances and to efficiently transfer it to source speech content without requiring text transcriptions or speaker labeling. Our proposed approach introduces an attention mechanism utilizing a self-supervised learning (SSL) model to collect the speaking styles of a target speaker each corresponding to the different phonetic content. The styles are represented with a set of embeddings called stylebook. In the next step, the stylebook is attended with the source speech's phonetic content to determine the final target style for each source content. Finally, content information extracted from the source speech and content-dependent target style embeddings are fed into a diffusion-based decoder to generate the converted speech mel-spectrogram. Experiment results show that our proposed method combined with a diffusion-based generative model can achieve better speaker similarity in any-to-any voice conversion tasks when compared to baseline models, while the increase in computational complexity with longer utterances is suppressed.
翻译:尽管近年来许多任意到任意语音转换模型成功地将目标语音的某些风格信息迁移至转换后的语音中,但其仍缺乏忠实再现目标说话人说话风格的能力。本文提出一种新颖方法,从目标语句中提取丰富的风格信息,并将其高效迁移至源语音内容中,且无需文本标注或说话人标签。该方法引入基于自监督学习模型的注意力机制,采集目标说话人对应不同音素内容的说话风格,并通过一组称为"风格手册"的嵌入向量表征这些风格。随后,源语音的音素内容与风格手册进行注意力计算,确定每个源内容对应的最终目标风格。最后,从源语音中提取的内容信息与基于内容依赖的目标风格嵌入被输入基于扩散模型的解码器,以生成转换后的语音梅尔频谱图。实验结果表明,与基线模型相比,本方法结合扩散生成模型可在任意到任意语音转换任务中实现更优的说话人相似度,同时抑制长语句带来的计算复杂度增长。