It is a challenging problem to reproduce rich spatial details while maintaining temporal consistency in real-world video super-resolution (Real-VSR), especially when we leverage pre-trained generative models such as stable diffusion (SD) for realistic details synthesis. Existing SD-based Real-VSR methods often compromise spatial details for temporal coherence, resulting in suboptimal visual quality. We argue that the key lies in how to effectively extract the degradation-robust temporal consistency priors from the low-quality (LQ) input video and enhance the video details while maintaining the extracted consistency priors. To achieve this, we propose a Dual LoRA Learning (DLoRAL) paradigm to train an effective SD-based one-step diffusion model, achieving realistic frame details and temporal consistency simultaneously. Specifically, we introduce a Cross-Frame Retrieval (CFR) module to aggregate complementary information across frames, and train a Consistency-LoRA (C-LoRA) to learn robust temporal representations from degraded inputs. After consistency learning, we fix the CFR and C-LoRA modules and train a Detail-LoRA (D-LoRA) to enhance spatial details while aligning with the temporal space defined by C-LoRA to keep temporal coherence. The two phases alternate iteratively for optimization, collaboratively delivering consistent and detail-rich outputs. During inference, the two LoRA branches are merged into the SD model, allowing efficient and high-quality video restoration in a single diffusion step. Experiments show that DLoRAL achieves strong performance in both accuracy and speed. Code and models are available at https://github.com/yjsunnn/DLoRAL.
翻译:在真实世界视频超分辨率(Real-VSR)任务中,如何在利用预训练生成模型(如稳定扩散模型)合成逼真细节的同时,重现丰富的空间细节并保持时序一致性,是一个极具挑战性的问题。现有的基于稳定扩散的Real-VSR方法往往为了时序连贯性而牺牲空间细节,导致视觉质量欠佳。我们认为,关键在于如何从低质量输入视频中有效提取对退化鲁棒的时序一致性先验,并在增强视频细节的同时保持所提取的一致性先验。为此,我们提出了一种双LoRA学习范式,用于训练一个高效的基于稳定扩散的一步式扩散模型,从而同时实现逼真的帧细节与时序一致性。具体而言,我们引入了一个跨帧检索模块来聚合跨帧的互补信息,并训练一个一致性LoRA模块以从退化输入中学习鲁棒的时序表征。在完成一致性学习后,我们固定跨帧检索模块和一致性LoRA模块,并训练一个细节LoRA模块以增强空间细节,同时使其与一致性LoRA所定义的时序空间对齐以保持时序连贯性。这两个阶段通过迭代交替进行优化,协同生成一致且细节丰富的输出。在推理阶段,两个LoRA分支被合并到稳定扩散模型中,从而在单步扩散内实现高效且高质量的视频恢复。实验表明,所提方法在精度与速度方面均表现出色。代码与模型已发布于https://github.com/yjsunnn/DLoRAL。