Large Language Models (LLMs) have become increasingly vulnerable to jailbreak attacks that circumvent their safety mechanisms. While existing defense methods either suffer from adaptive attacks or require computationally expensive auxiliary models, we present STShield, a lightweight framework for real-time jailbroken judgement. STShield introduces a novel single-token sentinel mechanism that appends a binary safety indicator to the model's response sequence, leveraging the LLM's own alignment capabilities for detection. Our framework combines supervised fine-tuning on normal prompts with adversarial training using embedding-space perturbations, achieving robust detection while preserving model utility. Extensive experiments demonstrate that STShield successfully defends against various jailbreak attacks, while maintaining the model's performance on legitimate queries. Compared to existing approaches, STShield achieves superior defense performance with minimal computational overhead, making it a practical solution for real-world LLM deployment.
翻译:大语言模型(LLMs)日益容易受到规避其安全机制的越狱攻击。现有防御方法或难以抵御自适应攻击,或需要计算成本高昂的辅助模型。本文提出STShield——一种用于实时越狱判定的轻量级框架。该框架引入创新的单令牌哨兵机制,通过向模型响应序列附加二元安全指示符,利用LLM自身的对齐能力进行检测。我们的框架结合了正常提示的监督微调与嵌入空间扰动的对抗训练,在保持模型实用性的同时实现鲁棒检测。大量实验表明,STShield能成功防御多种越狱攻击,且在处理合法查询时保持模型性能。相较于现有方法,STShield以最小计算开销实现了更优的防御性能,为实际场景中的LLM部署提供了实用解决方案。