RViDeformer: Efficient Raw Video Denoising Transformer with a Larger Benchmark Dataset

In recent years, raw video denoising has garnered increased attention due to the consistency with the imaging process and well-studied noise modeling in the raw domain. However, two problems still hinder the denoising performance. Firstly, there is no large dataset with realistic motions for supervised raw video denoising, as capturing noisy and clean frames for real dynamic scenes is difficult. To address this, we propose recapturing existing high-resolution videos displayed on a 4K screen with high-low ISO settings to construct noisy-clean paired frames. In this way, we construct a video denoising dataset (named as ReCRVD) with 120 groups of noisy-clean videos, whose ISO values ranging from 1600 to 25600. Secondly, while non-local temporal-spatial attention is beneficial for denoising, it often leads to heavy computation costs. We propose an efficient raw video denoising transformer network (RViDeformer) that explores both short and long-distance correlations. Specifically, we propose multi-branch spatial and temporal attention modules, which explore the patch correlations from local window, local low-resolution window, global downsampled window, and neighbor-involved window, and then they are fused together. We employ reparameterization to reduce computation costs. Our network is trained in both supervised and unsupervised manners, achieving the best performance compared with state-of-the-art methods. Additionally, the model trained with our proposed dataset (ReCRVD) outperforms the model trained with previous benchmark dataset (CRVD) when evaluated on the real-world outdoor noisy videos. Our code and dataset will be released after the acceptance of this work.

翻译：近年来，原始视频去噪因与成像过程的一致性及原始域中噪声建模的成熟研究而受到更多关注。然而，两个问题仍制约着去噪性能。首先，目前缺乏用于监督式原始视频去噪的大规模真实运动数据集，因为在实际动态场景中捕捉含噪与干净帧存在困难。为解决此问题，我们提出通过高低ISO设置，在4K屏幕上重新录制现有高分辨率视频，以构建含噪-干净配对帧。据此，我们构建了一个包含120组含噪-干净视频的视频去噪数据集（命名为ReCRVD），其ISO值范围为1600至25600。其次，尽管非局部时空注意力有助于去噪，但常导致高计算开销。我们提出一种高效原始视频去噪Transformer网络（RViDeformer），该网络同时探索短距离与长距离相关性。具体而言，我们提出多分支空间与时间注意力模块，从局部窗口、局部低分辨率窗口、全局降采样窗口及邻域参与窗口中探索块相关性，并进行特征融合。我们采用重参数化以降低计算成本。该网络以监督与无监督两种方式训练，在性能上优于现有最先进方法。此外，在真实室外含噪视频评估中，基于所提数据集（ReCRVD）训练的模型优于基于先前基准数据集（CRVD）训练的模型。本工作的代码与数据集将在论文接收后公开。