Real-world low-resolution (LR) videos have diverse and complex degradations, imposing great challenges on video super-resolution (VSR) algorithms to reproduce their high-resolution (HR) counterparts with high quality. Recently, the diffusion models have shown compelling performance in generating realistic details for image restoration tasks. However, the diffusion process has randomness, making it hard to control the contents of restored images. This issue becomes more serious when applying diffusion models to VSR tasks because temporal consistency is crucial to the perceptual quality of videos. In this paper, we propose an effective real-world VSR algorithm by leveraging the strength of pre-trained latent diffusion models. To ensure the content consistency among adjacent frames, we exploit the temporal dynamics in LR videos to guide the diffusion process by optimizing the latent sampling path with a motion-guided loss, ensuring that the generated HR video maintains a coherent and continuous visual flow. To further mitigate the discontinuity of generated details, we insert temporal module to the decoder and fine-tune it with an innovative sequence-oriented loss. The proposed motion-guided latent diffusion (MGLD) based VSR algorithm achieves significantly better perceptual quality than state-of-the-arts on real-world VSR benchmark datasets, validating the effectiveness of the proposed model design and training strategies.
翻译:真实世界中的低分辨率视频存在多样且复杂的退化现象,这对视频超分辨率算法重建高质量的高分辨率视频提出了巨大挑战。近年来,扩散模型在图像复原任务中展现出生成逼真细节的强大能力。然而,扩散过程具有随机性,导致难以控制复原图像的内容。当将扩散模型应用于视频超分辨率任务时,这一问题变得尤为突出,因为时间一致性对视频的感知质量至关重要。本文提出一种有效的真实世界视频超分辨率算法,通过利用预训练潜在扩散模型的优势来解决该问题。为确保相邻帧之间的内容一致性,我们利用低分辨率视频中的时序动态信息来引导扩散过程,通过运动引导损失优化潜在采样路径,从而确保生成的高分辨率视频保持连贯且连续的视觉流。为进一步缓解生成细节的不连续性,我们在解码器中插入时序模块,并采用创新的序列导向损失对其进行微调。基于所提出的运动引导潜在扩散模型的视频超分辨率算法,在真实世界视频超分辨率基准数据集上取得了显著优于现有方法的感知质量,验证了所提模型设计与训练策略的有效性。