Video deblurring methods, aiming at recovering consecutive sharp frames from a given blurry video, usually assume that the input video suffers from consecutively blurry frames. However, in real-world scenarios captured by modern imaging devices, sharp frames often interspersed within the video, providing temporally nearest sharp features that can aid in the restoration of blurry frames. In this work, we propose a video deblurring method that leverages both neighboring frames and existing sharp frames using hybrid Transformers for feature aggregation. Specifically, we first train a blur-aware detector to distinguish between sharp and blurry frames. Then, a window-based local Transformer is employed for exploiting features from neighboring frames, where cross attention is beneficial for aggregating features from neighboring frames without explicit spatial alignment. To aggregate nearest sharp features from detected sharp frames, we utilize a global Transformer with multi-scale matching capability. Moreover, our method can easily be extended to event-driven video deblurring by incorporating an event fusion module into the global Transformer. Extensive experiments on benchmark datasets demonstrate that our proposed method outperforms state-of-the-art video deblurring methods as well as event-driven video deblurring methods in terms of quantitative metrics and visual quality. The source code and trained models are available at https://github.com/shangwei5/STGTN.
翻译:视频去模糊方法旨在从给定的模糊视频中恢复连续的清晰帧,通常假设输入视频存在连续模糊帧。然而,在现代成像设备捕获的真实场景中,清晰帧常会穿插于视频中,这些时间上最近的锐利特征有助于模糊帧的恢复。本研究提出一种视频去模糊方法,通过混合Transformer进行特征聚合,同时利用相邻帧与现有清晰帧。具体而言,我们首先训练一个模糊感知检测器以区分清晰帧与模糊帧。随后采用基于窗口的局部Transformer从相邻帧中提取特征,其中交叉注意力机制能够在无需显式空间对齐的情况下聚合相邻帧特征。为聚合从检测到的清晰帧中提取的最近锐利特征,我们采用具备多尺度匹配能力的全局Transformer。此外,通过将事件融合模块集成至全局Transformer,本方法可轻松扩展至事件驱动的视频去模糊任务。在基准数据集上的大量实验表明,所提方法在定量指标与视觉质量方面均优于当前最先进的视频去模糊方法及事件驱动视频去模糊方法。源代码与训练模型发布于https://github.com/shangwei5/STGTN。