Visual State Space Model (VSS) has demonstrated remarkable performance in various computer vision tasks. However, in the process of development, backdoor attacks have brought severe challenges to security. Such attacks cause an infected model to predict target labels when a specific trigger is activated, while the model behaves normally on benign samples. In this paper, we conduct systematic experiments to comprehend on robustness of VSS through the lens of backdoor attacks, specifically how the state space model (SSM) mechanism affects robustness. We first investigate the vulnerability of VSS to different backdoor triggers and reveal that the SSM mechanism, which captures contextual information within patches, makes the VSS model more susceptible to backdoor triggers compared to models without SSM. Furthermore, we analyze the sensitivity of the VSS model to patch processing techniques and discover that these triggers are effectively disrupted. Based on these observations, we consider an effective backdoor for the VSS model that recurs in each patch to resist patch perturbations. Extensive experiments across three datasets and various backdoor attacks reveal that the VSS model performs comparably to Transformers (ViTs) but is less robust than the Gated CNNs, which comprise only stacked Gated CNN blocks without SSM.
翻译:视觉状态空间模型(VSS)在各种计算机视觉任务中展现出卓越的性能。然而,在其发展过程中,后门攻击给安全性带来了严峻挑战。此类攻击会导致受感染的模型在特定触发器被激活时预测目标标签,而在良性样本上模型表现正常。本文通过后门攻击的视角,系统性地进行实验以理解VSS的鲁棒性,特别是状态空间模型(SSM)机制如何影响鲁棒性。我们首先研究了VSS对不同后门触发器的脆弱性,并揭示了SSM机制(其捕获图像块内的上下文信息)使得VSS模型比没有SSM的模型更容易受到后门触发器的影响。此外,我们分析了VSS模型对图像块处理技术的敏感性,发现这些触发器能被有效破坏。基于这些观察,我们为VSS模型设计了一种有效的后门,该后门在每个图像块中重复出现以抵抗图像块扰动。在三个数据集和各种后门攻击上进行的大量实验表明,VSS模型的表现与Transformer(ViT)相当,但其鲁棒性低于仅由堆叠的门控CNN块组成、不含SSM的门控CNN模型。