This paper presents a novel task, zero-shot voice conversion based on face images (zero-shot FaceVC), which aims at converting the voice characteristics of an utterance from any source speaker to a newly coming target speaker, solely relying on a single face image of the target speaker. To address this task, we propose a face-voice memory-based zero-shot FaceVC method. This method leverages a memory-based face-voice alignment module, in which slots act as the bridge to align these two modalities, allowing for the capture of voice characteristics from face images. A mixed supervision strategy is also introduced to mitigate the long-standing issue of the inconsistency between training and inference phases for voice conversion tasks. To obtain speaker-independent content-related representations, we transfer the knowledge from a pretrained zero-shot voice conversion model to our zero-shot FaceVC model. Considering the differences between FaceVC and traditional voice conversion tasks, systematic subjective and objective metrics are designed to thoroughly evaluate the homogeneity, diversity and consistency of voice characteristics controlled by face images. Through extensive experiments, we demonstrate the superiority of our proposed method on the zero-shot FaceVC task. Samples are presented on our demo website.
翻译:本文提出一项新颖任务:基于人脸图像的零样本语音转换(零样本FaceVC),旨在仅依靠目标说话人的单一脸部图像,将任意源说话人语音的音色特征转换为新出现的该目标说话人的语音。为应对此任务,我们提出了一种基于人脸-声音记忆的零样本FaceVC方法。该方法利用记忆性人脸-声音对齐模块,其中槽位作为桥接两种模态的媒介,从而从人脸图像中捕获语音特征。我们还引入了一种混合监督策略,以缓解语音转换任务中长期存在的训练与推理阶段不一致问题。为获取与说话人无关的内容相关表示,我们将预训练的零样本语音转换模型的知识迁移至我们的零样本FaceVC模型。考虑到FaceVC与传统语音转换任务的差异,我们设计了系统化的主观与客观评估指标,以全面评估由人脸图像控制的语音特征的同质性、多样性及一致性。通过大量实验,我们证明了所提方法在零样本FaceVC任务上的优越性。示例已公布于我们的演示网站。