CLASP: Defending Hybrid Large Language Models Against Hidden State Poisoning Attacks

State space models (SSMs) like Mamba have gained significant traction as efficient alternatives to Transformers, achieving linear complexity while maintaining competitive performance. However, Hidden State Poisoning Attacks (HiSPAs), a recently discovered vulnerability that corrupts SSM memory through adversarial strings, pose a critical threat to these architectures and their hybrid variants. Framing the HiSPA mitigation task as a binary classification problem at the token level, we introduce the CLASP model to defend against this threat. CLASP exploits distinct patterns in Mamba's block output embeddings (BOEs) and uses an XGBoost classifier to identify malicious tokens with minimal computational overhead. We consider a realistic scenario in which both SSMs and HiSPAs are likely to be used: an LLM screening résumés to identify the best candidates for a role. Evaluated on a corpus of 2,483 résumés totaling 9.5M tokens with controlled injections, CLASP achieves 95.9% token-level F1 score and 99.3% document-level F1 score on malicious tokens detection. Crucially, the model generalizes to unseen attack patterns: under leave-one-out cross-validation, performance remains high (96.9% document-level F1), while under clustered cross-validation with structurally novel triggers, it maintains useful detection capability (91.6% average document-level F1). Operating independently of any downstream model, CLASP processes 1,032 tokens per second with under 4GB VRAM consumption, potentially making it suitable for real-world deployment as a lightweight front-line defense for SSM-based and hybrid architectures. All code and detailed results are available at https://anonymous.4open.science/r/hispikes-91C0.

翻译：Mamba等状态空间模型（SSM）作为Transformer的高效替代方案已获得广泛关注，其在保持竞争力的同时实现了线性复杂度。然而，最近发现的隐藏状态中毒攻击（HiSPA）通过对抗性字符串破坏SSM记忆，对这些架构及其混合变体构成了严重威胁。本文将HiSPA防御任务构建为词元级别的二分类问题，提出CLASP模型以应对此威胁。CLASP利用Mamba块输出嵌入（BOE）中的独特模式，采用XGBoost分类器以最小计算开销识别恶意词元。我们考虑了一个SSM与HiSPA可能同时存在的现实场景：大型语言模型筛选简历以识别最佳职位候选人。在包含950万词元、2483份简历的受控注入语料库上评估，CLASP在恶意词元检测中实现了95.9%的词元级F1分数和99.3%的文档级F1分数。关键的是，该模型能泛化至未见攻击模式：在留一法交叉验证中保持高性能（96.9%文档级F1）；在结构新颖触发器的聚类交叉验证中仍保持有效检测能力（平均91.6%文档级F1）。CLASP独立于下游模型运行，每秒可处理1032个词元且VRAM消耗低于4GB，有望作为基于SSM及混合架构的轻量级前线防御方案投入实际应用。所有代码与详细结果详见https://anonymous.4open.science/r/hispikes-91C0。