Multi-dimensional Visual Prompt Enhanced Image Restoration via Mamba-Transformer Aggregation

Recent efforts on image restoration have focused on developing "all-in-one" models that can handle different degradation types and levels within single model. However, most of mainstream Transformer-based ones confronted with dilemma between model capabilities and computation burdens, since self-attention mechanism quadratically increase in computational complexity with respect to image size, and has inadequacies in capturing long-range dependencies. Most of Mamba-related ones solely scanned feature map in spatial dimension for global modeling, failing to fully utilize information in channel dimension. To address aforementioned problems, this paper has proposed to fully utilize complementary advantages from Mamba and Transformer without sacrificing computation efficiency. Specifically, the selective scanning mechanism of Mamba is employed to focus on spatial modeling, enabling capture long-range spatial dependencies under linear complexity. The self-attention mechanism of Transformer is applied to focus on channel modeling, avoiding high computation burdens that are in quadratic growth with image's spatial dimensions. Moreover, to enrich informative prompts for effective image restoration, multi-dimensional prompt learning modules are proposed to learn prompt-flows from multi-scale encoder/decoder layers, benefiting for revealing underlying characteristic of various degradations from both spatial and channel perspectives, therefore, enhancing the capabilities of "all-in-one" model to solve various restoration tasks. Extensive experiment results on several image restoration benchmark tasks such as image denoising, dehazing, and deraining, have demonstrated that the proposed method can achieve new state-of-the-art performance, compared with many popular mainstream methods. Related source codes and pre-trained parameters will be public on github https://github.com/12138-chr/MTAIR.

翻译：近年来，图像复原领域的研究致力于开发能够处理多种退化类型和退化程度的“一体化”模型。然而，当前主流的基于Transformer的模型面临着模型能力与计算负担之间的困境，因为自注意力机制的计算复杂度随图像尺寸呈二次方增长，且在捕获长程依赖方面存在不足。大多数基于Mamba的模型仅在空间维度上扫描特征图进行全局建模，未能充分利用通道维度的信息。为解决上述问题，本文提出在不牺牲计算效率的前提下，充分利用Mamba与Transformer的互补优势。具体而言，采用Mamba的选择性扫描机制专注于空间建模，使其能够在线性复杂度下捕获长程空间依赖关系。应用Transformer的自注意力机制专注于通道建模，避免了随图像空间尺寸二次增长的高计算负担。此外，为了丰富信息提示以实现有效的图像复原，本文提出了多维提示学习模块，用于从多尺度编码器/解码器层学习提示流，这有助于从空间和通道两个视角揭示各种退化的潜在特性，从而增强“一体化”模型解决多种复原任务的能力。在图像去噪、去雾、去雨等多个图像复原基准任务上进行的大量实验结果表明，与许多主流方法相比，所提方法能够取得新的最优性能。相关源代码与预训练参数将公开于 https://github.com/12138-chr/MTAIR。