Can SAM Boost Video Super-Resolution?

The primary challenge in video super-resolution (VSR) is to handle large motions in the input frames, which makes it difficult to accurately aggregate information from multiple frames. Existing works either adopt deformable convolutions or estimate optical flow as a prior to establish correspondences between frames for the effective alignment and fusion. However, they fail to take into account the valuable semantic information that can greatly enhance it; and flow-based methods heavily rely on the accuracy of a flow estimate model, which may not provide precise flows given two low-resolution frames. In this paper, we investigate a more robust and semantic-aware prior for enhanced VSR by utilizing the Segment Anything Model (SAM), a powerful foundational model that is less susceptible to image degradation. To use the SAM-based prior, we propose a simple yet effective module -- SAM-guidEd refinEment Module (SEEM), which can enhance both alignment and fusion procedures by the utilization of semantic information. This light-weight plug-in module is specifically designed to not only leverage the attention mechanism for the generation of semantic-aware feature but also be easily and seamlessly integrated into existing methods. Concretely, we apply our SEEM to two representative methods, EDVR and BasicVSR, resulting in consistently improved performance with minimal implementation effort, on three widely used VSR datasets: Vimeo-90K, REDS and Vid4. More importantly, we found that the proposed SEEM can advance the existing methods in an efficient tuning manner, providing increased flexibility in adjusting the balance between performance and the number of training parameters. Code will be open-source soon.

翻译：视频超分辨率（VSR）的主要挑战在于处理输入帧中的大运动，这使得难以准确聚合多帧信息。现有工作要么采用可变形卷积，要么估计光流作为先验来建立帧间对应关系，从而实现有效的对齐与融合。然而，这些方法未能利用能够显著增强性能的宝贵语义信息；基于光流的方法严重依赖光流估计模型的精度，而面对两幅低分辨率帧时，该模型可能无法提供精确的光流。本文中，我们研究了一种更鲁棒且具有语义感知的先验，通过利用Segment Anything Model（SAM，一种对图像退化不敏感的强基础模型）来增强VSR。为使用基于SAM的先验，我们提出了一种简单而有效的模块——SAM引导的精化模块（SEEM），通过利用语义信息同时增强对齐与融合过程。该轻量级插件模块专门设计，不仅利用注意力机制生成语义感知特征，还能轻松无缝集成到现有方法中。具体而言，我们将SEEM应用于两种代表性方法EDVR和BasicVSR，在三个广泛使用的VSR数据集Vimeo-90K、REDS和Vid4上，以极小的实现代价持续提升了性能。更重要的是，我们发现所提出的SEEM能够以高效的调优方式推进现有方法，在性能与训练参数数量之间提供更灵活的平衡调整。代码即将开源。