In this paper, we present a novel robust framework for low-level vision tasks, including denoising, object removal, frame interpolation, and super-resolution, that does not require any external training data corpus. Our proposed approach directly learns the weights of neural modules by optimizing over the corrupted test sequence, leveraging the spatio-temporal coherence and internal statistics of videos. Furthermore, we introduce a novel spatial pyramid loss that leverages the property of spatio-temporal patch recurrence in a video across the different scales of the video. This loss enhances robustness to unstructured noise in both the spatial and temporal domains. This further results in our framework being highly robust to degradation in input frames and yields state-of-the-art results on downstream tasks such as denoising, object removal, and frame interpolation. To validate the effectiveness of our approach, we conduct qualitative and quantitative evaluations on standard video datasets such as DAVIS, UCF-101, and VIMEO90K-T.
翻译:本文提出了一种新颖的鲁棒框架,用于低级视觉任务(包括去噪、物体去除、帧插值和超分辨率),且无需任何外部训练数据语料库。我们的方法直接通过优化受损测试序列来学习神经模块的权重,充分利用视频的时空一致性和内部统计特性。此外,我们引入了一种新颖的空间金字塔损失函数,该函数利用了视频中不同尺度下的时空块重复特性。这一损失函数增强了模型在空间和时间域内对非结构化噪声的鲁棒性,进而使我们的框架对输入帧的退化具有高度鲁棒性,并能在去噪、物体去除和帧插值等下游任务中达到最先进水平。为验证方法的有效性,我们在DAVIS、UCF-101和VIMEO90K-T等标准视频数据集上进行了定性和定量评估。