Recently, State Space Models (SSMs) with efficient hardware-aware designs, i.e., Mamba, have demonstrated significant potential in computer vision tasks due to their linear computational complexity with respect to token length and their global receptive field. However, Mamba's performance on dense prediction tasks, including human pose estimation and semantic segmentation, has been constrained by three key challenges: insufficient inductive bias, long-range forgetting, and low-resolution output representation. To address these challenges, we introduce the Dynamic Visual State Space (DVSS) block, which utilizes multi-scale convolutional kernels to extract local features across different scales and enhance inductive bias, and employs deformable convolution to mitigate the long-range forgetting problem while enabling adaptive spatial aggregation based on input and task-specific information. By leveraging the multi-resolution parallel design proposed in HRNet, we introduce High-Resolution Visual State Space Model (HRVMamba) based on the DVSS block, which preserves high-resolution representations throughout the entire process while promoting effective multi-scale feature learning. Extensive experiments highlight HRVMamba's impressive performance on dense prediction tasks, achieving competitive results against existing benchmark models without bells and whistles. Code is available at https://github.com/zhanghao5201/HRVMamba.
翻译:近年来,具备高效硬件感知设计的视觉状态空间模型(如Mamba)因其随序列长度呈线性增长的计算复杂度及全局感受野,在计算机视觉任务中展现出巨大潜力。然而,Mamba在人体姿态估计、语义分割等密集预测任务中的性能受到三大关键挑战的制约:归纳偏置不足、长程遗忘问题以及低分辨率输出表示。为应对这些挑战,我们提出了动态视觉状态空间(DVSS)模块,该模块利用多尺度卷积核提取不同尺度的局部特征以增强归纳偏置,并采用可变形卷积来缓解长程遗忘问题,同时实现基于输入与任务信息的自适应空间聚合。通过借鉴HRNet提出的多分辨率并行架构,我们基于DVSS模块构建了高分辨率视觉状态空间模型(HRVMamba)。该模型在全程保持高分辨率特征表示的同时,促进了有效的多尺度特征学习。大量实验表明,HRVMamba在密集预测任务中取得了优异性能,在不使用额外技巧的情况下即可与现有基准模型竞争。代码已开源:https://github.com/zhanghao5201/HRVMamba。