This paper studies an efficient multimodal data communication scheme for video conferencing. In our considered system, a speaker gives a talk to the audiences, with talking head video and audio being transmitted. Since the speaker does not frequently change posture and high-fidelity transmission of audio (speech and music) is required, redundant visual video data exists and can be removed by generating the video from the audio. To this end, we propose a wave-to-video (Wav2Vid) system, an efficient video transmission framework that reduces transmitted data by generating talking head video from audio. In particular, full-duration audio and short-duration video data are synchronously transmitted through a wireless channel, with neural networks (NNs) extracting and encoding audio and video semantics. The receiver then combines the decoded audio and video data, as well as uses a generative adversarial network (GAN) based model to generate the lip movement videos of the speaker. Simulation results show that the proposed Wav2Vid system can reduce the amount of transmitted data by up to 83% while maintaining the perceptual quality of the generated conferencing video.
翻译:本文研究一种用于视频会议的高效多模态数据通信方案。在所考虑的系统中,演讲者向听众发表讲话,其头部视频与音频被同步传输。由于演讲者姿态变化不频繁且音频(语音与音乐)需高保真传输,视觉视频数据存在冗余,可通过音频生成视频予以消除。为此,我们提出一种波至视频系统,即通过音频生成说话人头部视频以降低传输数据量的高效视频传输框架。具体而言,全时段音频与短时段视频数据通过无线信道同步传输,神经网络负责提取并编码音频与视频语义。接收端随后融合解码后的音视频数据,并采用基于生成对抗网络的模型生成说话人唇部运动视频。仿真结果表明,所提出的波至视频系统在保持生成会议视频感知质量的同时,可将传输数据量降低高达83%。