The existing state-of-the-art method for audio-visual conditioned video prediction uses the latent codes of the audio-visual frames from a multimodal stochastic network and a frame encoder to predict the next visual frame. However, a direct inference of per-pixel intensity for the next visual frame is extremely challenging because of the high-dimensional image space. To this end, we decouple the audio-visual conditioned video prediction into motion and appearance modeling. The multimodal motion estimation predicts future optical flow based on the audio-motion correlation. The visual branch recalls from the motion memory built from the audio features to enable better long term prediction. We further propose context-aware refinement to address the diminishing of the global appearance context in the long-term continuous warping. The global appearance context is extracted by the context encoder and manipulated by motion-conditioned affine transformation before fusion with features of warped frames. Experimental results show that our method achieves competitive results on existing benchmarks.
翻译:现有最先进的视听条件视频预测方法利用多模态随机网络和帧编码器的视听帧潜在编码来预测下一个视觉帧。然而,由于高维图像空间的特性,直接逐像素推断下一帧强度极具挑战性。为此,我们将视听条件视频预测解耦为运动建模与外观建模。多模态运动估计基于音频-运动相关性预测未来光流。视觉分支利用音频特征构建的运动记忆进行召回,以实现更优的长时预测。我们进一步提出上下文感知精细化机制,解决长时连续扭曲中全局外观上下文衰减问题。通过上下文编码器提取全局外观上下文,并由运动条件仿射变换调控后与扭曲帧特征进行融合。实验结果表明,该方法在现有基准测试中取得了具有竞争力的结果。