Humans can easily imagine a scene from auditory information based on their prior knowledge of audio-visual events. In this paper, we mimic this innate human ability in deep learning models to improve the quality of video inpainting. To implement the prior knowledge, we first train the audio-visual network, which learns the correspondence between auditory and visual information. Then, the audio-visual network is employed as a guider that conveys the prior knowledge of audio-visual correspondence to the video inpainting network. This prior knowledge is transferred through our proposed two novel losses: audio-visual attention loss and audio-visual pseudo-class consistency loss. These two losses further improve the performance of the video inpainting by encouraging the inpainting result to have a high correspondence to its synchronized audio. Experimental results demonstrate that our proposed method can restore a wider domain of video scenes and is particularly effective when the sounding object in the scene is partially blinded.
翻译:人类能够基于音视频事件的先验知识,仅凭听觉信息轻松想象出场景。本文模拟人类这一天生能力,利用深度学习模型提升视频修复质量。为实现先验知识编码,我们首先训练音视频网络学习听觉与视觉信息之间的对应关系;随后将该音视频网络作为引导器,将音视频对应关系的先验知识传递至视频修复网络。该先验知识通过我们提出的两项新型损失函数进行迁移:音视频注意力损失与音视频伪类一致性损失。这两项损失通过增强修复结果与其同步音频的高度对应性,进一步提升视频修复性能。实验结果表明,本方法能够恢复更广泛场景的视频内容,尤其适用于场景中发声物体部分遮挡的情况。