Existing works on video frame interpolation (VFI) mostly employ deep neural networks trained to minimize the L1 or L2 distance between their outputs and ground-truth frames. Despite recent advances, existing VFI methods tend to produce perceptually inferior results, particularly for challenging scenarios including large motions and dynamic textures. Towards developing perceptually-oriented VFI methods, we propose latent diffusion model-based VFI, LDMVFI. This approaches the VFI problem from a generative perspective by formulating it as a conditional generation problem. As the first effort to address VFI using latent diffusion models, we rigorously benchmark our method following the common evaluation protocol adopted in the existing VFI literature. Our quantitative experiments and user study indicate that LDMVFI is able to interpolate video content with superior perceptual quality compared to the state of the art, even in the high-resolution regime. Our source code will be made available here.
翻译:现有视频帧插值(VFI)工作主要采用深度神经网络,通过最小化输出与真实帧之间的L1或L2距离进行训练。尽管取得了最新进展,现有VFI方法在包含大运动与动态纹理等挑战性场景中,仍倾向于产生感知质量欠佳的结果。为发展感知导向的VFI方法,我们提出了基于潜在扩散模型的VFI方案LDMVFI。该方法将VFI问题从生成视角重新定义,将其建模为条件生成问题。作为首个利用潜在扩散模型解决VFI的尝试,我们严格遵循现有VFI文献的通用评估协议对方法进行基准测试。定量实验与用户研究表明,即便在高分辨率场景下,LDMVFI相比现有最优方法仍能生成具有卓越感知质量的插值视频内容。我们的源代码将在指定地址公开。