As a critical clue of video super-resolution (VSR), inter-frame alignment significantly impacts overall performance. However, accurate pixel-level alignment is a challenging task due to the intricate motion interweaving in the video. In response to this issue, we introduce a novel paradigm for VSR named Semantic Lens, predicated on semantic priors drawn from degraded videos. Specifically, video is modeled as instances, events, and scenes via a Semantic Extractor. Those semantics assist the Pixel Enhancer in understanding the recovered contents and generating more realistic visual results. The distilled global semantics embody the scene information of each frame, while the instance-specific semantics assemble the spatial-temporal contexts related to each instance. Furthermore, we devise a Semantics-Powered Attention Cross-Embedding (SPACE) block to bridge the pixel-level features with semantic knowledge, composed of a Global Perspective Shifter (GPS) and an Instance-Specific Semantic Embedding Encoder (ISEE). Concretely, the GPS module generates pairs of affine transformation parameters for pixel-level feature modulation conditioned on global semantics. After that, the ISEE module harnesses the attention mechanism to align the adjacent frames in the instance-centric semantic space. In addition, we incorporate a simple yet effective pre-alignment module to alleviate the difficulty of model training. Extensive experiments demonstrate the superiority of our model over existing state-of-the-art VSR methods.
翻译:作为视频超分辨率(VSR)中至关重要的线索,帧间对齐显著影响整体性能。然而,由于视频中复杂的运动交织,精确的像素级对齐仍是一项极具挑战性的任务。针对这一问题,我们提出了一种基于退化视频语义先验的VSR新范式——语义透镜(Semantic Lens)。具体而言,通过语义提取器(Semantic Extractor)将视频建模为实例、事件和场景。这些语义信息辅助像素增强器(Pixel Enhancer)理解重建内容并生成更逼真的视觉结果。蒸馏得到的全局语义蕴含每帧的场景信息,而实例级语义则聚合与每个实例相关的时空上下文。此外,我们设计了语义驱动注意力交叉嵌入(SPACE)模块,用于桥接像素级特征与语义知识。该模块由全局视角移位器(GPS)和实例特定语义嵌入编码器(ISEE)组成。具体地,GPS模块基于全局语义为像素级特征调制生成成对仿射变换参数;随后,ISEE模块利用注意力机制在实例中心语义空间中对齐相邻帧。同时,我们引入简单有效的预对齐模块以降低模型训练难度。大量实验证明,本模型在性能上优于现有最先进的VSR方法。