We propose an efficient diffusion-based text-to-video super-resolution (SR) tuning approach that leverages the readily learned capacity of pixel level image diffusion model to capture spatial information for video generation. To accomplish this goal, we design an efficient architecture by inflating the weightings of the text-to-image SR model into our video generation framework. Additionally, we incorporate a temporal adapter to ensure temporal coherence across video frames. We investigate different tuning approaches based on our inflated architecture and report trade-offs between computational costs and super-resolution quality. Empirical evaluation, both quantitative and qualitative, on the Shutterstock video dataset, demonstrates that our approach is able to perform text-to-video SR generation with good visual quality and temporal consistency. To evaluate temporal coherence, we also present visualizations in video format in https://drive.google.com/drive/folders/1YVc-KMSJqOrEUdQWVaI-Yfu8Vsfu_1aO?usp=sharing .
翻译:我们提出了一种高效的基于扩散模型的文本到视频超分辨率调优方法,该方法利用像素级图像扩散模型已习得的空间信息捕捉能力进行视频生成。为实现这一目标,我们通过将文本到图像超分辨率模型的权重注入视频生成框架,设计了一种高效架构。同时,我们引入时序适配器以确保视频帧间的时序连贯性。基于所提出的注入架构,我们研究了不同调优方法,并报告了计算成本与超分辨率质量之间的权衡。在Shutterstock视频数据集上的定量与定性实证评估表明,我们的方法能够生成视觉质量良好且时序一致的文本到视频超分辨率结果。为评估时序连贯性,我们还提供了视频格式的可视化结果,详见https://drive.google.com/drive/folders/1YVc-KMSJqOrEUdQWVaI-Yfu8Vsfu_1aO?usp=sharing。