State space models (SSMs) like Mamba offer efficient alternatives to Transformer-based language models, with linear time complexity. Yet, their adversarial robustness remains critically unexplored. This paper studies the phenomenon whereby specific short input phrases induce a partial amnesia effect in such models, by irreversibly overwriting information in their hidden states, referred to as a Hidden State Poisoning Attack (HiSPA). Our benchmark RoBench25 allows evaluating a model's information retrieval capabilities when subject to HiSPAs, and confirms the vulnerability of SSMs against such attacks. Even a recent 52B hybrid SSM-Transformer model from the Jamba family collapses on RoBench25 under optimized HiSPA triggers, whereas pure Transformers do not. We also observe that HiSPA triggers significantly weaken the Jamba model on the popular Open-Prompt-Injections benchmark, unlike pure Transformers. Finally, our interpretability study reveals patterns in Mamba's hidden layers during HiSPAs that could be used to build a HiSPA mitigation system. The full code and data to reproduce the experiments can be found at https://anonymous.4open.science/r/hispa_anonymous-5DB0.
翻译:以Mamba为代表的状态空间模型(SSMs)为基于Transformer的语言模型提供了高效的替代方案,具有线性时间复杂度。然而,其对抗鲁棒性仍未得到充分探索。本文研究了一种特定现象:某些简短输入短语会通过不可逆地覆盖模型隐藏状态中的信息,在此类模型中诱发部分遗忘效应,我们称之为隐藏状态投毒攻击(HiSPA)。我们构建的基准测试RoBench25可用于评估模型在遭受HiSPA攻击时的信息检索能力,并证实了SSMs对此类攻击的脆弱性。即使是Jamba系列最新发布的520亿参数混合SSM-Transformer模型,在优化的HiSPA触发词作用下也会在RoBench25上完全失效,而纯Transformer模型则不受影响。我们还观察到,与纯Transformer模型不同,HiSPA触发词会显著削弱Jamba模型在流行基准Open-Prompt-Injections上的表现。最后,我们的可解释性研究揭示了Mamba模型在遭受HiSPA攻击时隐藏层的特定模式,这些模式可用于构建HiSPA缓解系统。重现实验所需的完整代码与数据可在https://anonymous.4open.science/r/hispa_anonymous-5DB0获取。