Transformers-based methods have achieved significant performance in image deraining as they can model the non-local information which is vital for high-quality image reconstruction. In this paper, we find that most existing Transformers usually use all similarities of the tokens from the query-key pairs for the feature aggregation. However, if the tokens from the query are different from those of the key, the self-attention values estimated from these tokens also involve in feature aggregation, which accordingly interferes with the clear image restoration. To overcome this problem, we propose an effective DeRaining network, Sparse Transformer (DRSformer) that can adaptively keep the most useful self-attention values for feature aggregation so that the aggregated features better facilitate high-quality image reconstruction. Specifically, we develop a learnable top-k selection operator to adaptively retain the most crucial attention scores from the keys for each query for better feature aggregation. Simultaneously, as the naive feed-forward network in Transformers does not model the multi-scale information that is important for latent clear image restoration, we develop an effective mixed-scale feed-forward network to generate better features for image deraining. To learn an enriched set of hybrid features, which combines local context from CNN operators, we equip our model with mixture of experts feature compensator to present a cooperation refinement deraining scheme. Extensive experimental results on the commonly used benchmarks demonstrate that the proposed method achieves favorable performance against state-of-the-art approaches. The source code and trained models are available at https://github.com/cschenxiang/DRSformer.
翻译:基于Transformer的方法在图像去雨中取得了显著性能,因为它们能够建模对高质量图像重建至关重要的非局部信息。本文发现,大多数现有Transformer通常使用查询-键对中所有标记(token)的相似性进行特征聚合。然而,若查询中的标记与键中的标记不同,则由这些标记估计的自注意力值也会参与特征聚合,从而干扰清晰图像的重建。为解决此问题,我们提出了一种高效的去雨网络——稀疏Transformer(DRSformer),该网络能够自适应地保留最有用的自注意力值用于特征聚合,从而使得聚合后的特征更好地促进高质量图像重建。具体而言,我们开发了一种可学习的top-k选择算子,以自适应地保留每个查询中来自键的最关键注意力分数,从而实现更好的特征聚合。同时,鉴于Transformer中朴素的前馈网络无法建模对潜在清晰图像重建至关重要的多尺度信息,我们设计了一种有效的混合尺度前馈网络,以生成更优的图像去雨特征。为了学习丰富的混合特征(结合CNN算子的局部上下文),我们为模型配备了专家混合特征补偿器,提出了一种协同优化的去雨方案。在常用基准数据集上的大量实验结果表明,所提方法相较于现有最优方法取得了优越性能。源代码和训练模型已公开于 https://github.com/cschenxiang/DRSformer。