Diffusion models have demonstrated exceptional capabilities in image generation and restoration, yet their application to video super-resolution faces significant challenges in maintaining both high fidelity and temporal consistency. We present DiffVSR, a diffusion-based framework for real-world video super-resolution that effectively addresses these challenges through key innovations. For intra-sequence coherence, we develop a multi-scale temporal attention module and temporal-enhanced VAE decoder that capture fine-grained motion details. To ensure inter-sequence stability, we introduce a noise rescheduling mechanism with an interweaved latent transition approach, which enhances temporal consistency without additional training overhead. We propose a progressive learning strategy that transitions from simple to complex degradations, enabling robust optimization despite limited high-quality video data. Extensive experiments demonstrate that DiffVSR delivers superior results in both visual quality and temporal consistency, setting a new performance standard in real-world video super-resolution.
翻译:扩散模型在图像生成与修复领域已展现出卓越能力,但其在视频超分辨率中的应用仍面临保持高保真度与时间一致性的重大挑战。本文提出DiffVSR,一种基于扩散的真实世界视频超分辨率框架,通过关键创新有效应对这些挑战。为保障序列内连贯性,我们开发了多尺度时间注意力模块与时间增强型VAE解码器,以捕捉细粒度运动细节。为确保序列间稳定性,我们引入了噪声重调度机制与交错潜在转移方法,在不增加额外训练开销的前提下增强时间一致性。我们提出一种从简单到复杂退化的渐进式学习策略,使得在高质量视频数据有限的情况下仍能实现鲁棒优化。大量实验表明,DiffVSR在视觉质量与时间一致性方面均取得优越结果,为真实世界视频超分辨率树立了新的性能标准。