Existing works on video frame interpolation (VFI) mostly employ deep neural networks trained to minimize the L1 or L2 distance between their outputs and ground-truth frames. Despite recent advances, existing VFI methods tend to produce perceptually inferior results, particularly for challenging scenarios including large motions and dynamic textures. Towards developing perceptually-oriented VFI methods, we propose latent diffusion model-based VFI, LDMVFI. This approaches the VFI problem from a generative perspective by formulating it as a conditional generation problem. As the first effort to address VFI using latent diffusion models, we rigorously benchmark our method following the common evaluation protocol adopted in the existing VFI literature. Our quantitative experiments and user study indicate that LDMVFI is able to interpolate video content with superior perceptual quality compared to the state of the art, even in the high-resolution regime. Our source code will be made available here.
翻译:现有视频帧插值(VFI)方法主要采用深度神经网络,以最小化输出与真实帧之间的L1或L2距离为目标进行训练。尽管近年来取得进展,现有VFI方法在处理包含大运动及动态纹理等具有挑战性的场景时,往往产生感知质量较差的结果。为开发面向感知优化的VFI方法,我们提出基于潜在扩散模型的VFI框架——LDMVFI。该方法将VFI问题转化为条件生成问题,从生成式视角予以解决。作为首个利用潜在扩散模型处理VFI的研究工作,我们严格遵循现有VFI文献中的通用评估协议,对方法进行系统性基准测试。定量实验及用户研究表明,即使在超高分辨率场景下,LDMVFI相较于现有最优方法仍能插值出具有卓越感知质量的视频内容。本方法源代码将公开发布。