Vision State Space Models (VSSMs), a novel architecture that combines the strengths of recurrent neural networks and latent variable models, have demonstrated remarkable performance in visual perception tasks by efficiently capturing long-range dependencies and modeling complex visual dynamics. However, their robustness under natural and adversarial perturbations remains a critical concern. In this work, we present a comprehensive evaluation of VSSMs' robustness under various perturbation scenarios, including occlusions, image structure, common corruptions, and adversarial attacks, and compare their performance to well-established architectures such as transformers and Convolutional Neural Networks. Furthermore, we investigate the resilience of VSSMs to object-background compositional changes on sophisticated benchmarks designed to test model performance in complex visual scenes. We also assess their robustness on object detection and segmentation tasks using corrupted datasets that mimic real-world scenarios. To gain a deeper understanding of VSSMs' adversarial robustness, we conduct a frequency analysis of adversarial attacks, evaluating their performance against low-frequency and high-frequency perturbations. Our findings highlight the strengths and limitations of VSSMs in handling complex visual corruptions, offering valuable insights for future research and improvements in this promising field. Our code and models will be available at https://github.com/HashmatShadab/MambaRobustness.
翻译:视觉状态空间模型(VSSMs)作为一种结合循环神经网络与隐变量模型优势的新型架构,通过高效捕获长程依赖关系并建模复杂视觉动态,在视觉感知任务中展现出卓越性能。然而,其在自然扰动与对抗扰动下的鲁棒性仍是关键问题。本文系统评估了VSSMs在遮挡、图像结构、常见损坏及对抗攻击等多种扰动场景下的鲁棒性,并将其与Transformer和卷积神经网络等成熟架构进行性能对比。此外,我们在专为测试复杂视觉场景模型性能而设计的精细基准上,探究了VSSMs对物体-背景组合变化的适应能力,并利用模拟真实场景的损坏数据集评估其在目标检测与分割任务中的鲁棒性。为深入理解VSSMs的对抗鲁棒性,我们对抗对抗攻击进行频率分析,评估其对低频与高频扰动的抵御能力。研究结果揭示了VSSMs在处理复杂视觉损坏时的优势与局限性,为该前沿领域的未来研究与改进提供了重要启示。相关代码与模型将开源至https://github.com/HashmatShadab/MambaRobustness。