Image-to-image translation and voice conversion enable the generation of a new facial image and voice while maintaining some of the semantics such as a pose in an image and linguistic content in audio, respectively. They can aid in the content-creation process in many applications. However, as they are limited to the conversion within each modality, matching the impression of the generated face and voice remains an open question. We propose a cross-modal style transfer framework called XFaVoT that jointly learns four tasks: image translation and voice conversion tasks with audio or image guidance, which enables the generation of ``face that matches given voice" and ``voice that matches given face", and intra-modality translation tasks with a single framework. Experimental results on multiple datasets show that XFaVoT achieves cross-modal style translation of image and voice, outperforming baselines in terms of quality, diversity, and face-voice correspondence.
翻译:图像到图像的转换与语音转换技术能够分别生成新的人脸图像和声音,同时保留图像中的姿态、音频中的语言内容等语义信息。这类技术可助力多种应用中的内容创作流程。然而,由于它们局限于各自模态内的转换,如何使生成的人脸和声音在感知上相互匹配仍是一个未解难题。我们提出了一种名为XFaVoT的跨模态风格迁移框架,该框架能够联合学习四个任务:基于音频或图像引导的图像转换与语音转换任务(可实现“匹配给定声音的人脸”与“匹配给定人脸的声音”的生成),以及单框架内的模态内转换任务。在多个数据集上的实验结果表明,XFaVoT实现了图像与语音的跨模态风格迁移,在质量、多样性以及人脸-声音匹配度方面均优于基线方法。