Image-to-image translation and voice conversion enable the generation of a new facial image and voice while maintaining some of the semantics such as a pose in an image and linguistic content in audio, respectively. They can aid in the content-creation process in many applications. However, as they are limited to the conversion within each modality, matching the impression of the generated face and voice remains an open question. We propose a cross-modal style transfer framework called XFaVoT that jointly learns four tasks: image translation and voice conversion tasks with audio or image guidance, which enables the generation of ``face that matches given voice" and ``voice that matches given face", and intra-modality translation tasks with a single framework. Experimental results on multiple datasets show that XFaVoT achieves cross-modal style translation of image and voice, outperforming baselines in terms of quality, diversity, and face-voice correspondence.
翻译:图像到图像转换与语音转换分别能在保持图像姿态和音频语言内容等语义信息的同时,生成新的人脸图像和语音。尽管这些技术可服务于多种应用场景的内容创作流程,但由于其仅限于单一模态内的转换,如何使生成的人脸与语音在感知印象上相互匹配仍是一个悬而未决的问题。本文提出一种名为XFaVoT的跨模态风格迁移框架,该框架联合学习四项任务:基于音频或图像引导的图像转换与语音转换任务(分别实现"匹配给定语音的人脸"与"匹配给定人脸的声音"),以及通过单一框架实现的模态内转换任务。多个数据集上的实验结果表明,XFaVoT能够实现图像与语音的跨模态风格转换,在质量、多样性和人脸-语音对应性方面均优于基准方法。