Large Language Models (LLMs) have been extensively used across diverse domains, including virtual assistants, automated code generation, and scientific research. However, they remain vulnerable to jailbreak attacks, which manipulate the models into generating harmful responses despite safety alignment. Recent studies have shown that current safety-aligned LLMs undergo shallow safety alignment. In this work, we conduct an in-depth investigation into the underlying mechanism of this phenomenon and reveal that it manifests through learned ''safety trigger tokens'' that activate the model's safety patterns when paired with the specific input. Through both analysis and empirical verification, we further demonstrate the high similarity of the safety trigger tokens across different harmful inputs. Accordingly, we propose D-STT, a simple yet effective defense algorithm that identifies and explicitly decodes safety trigger tokens of the given safety-aligned LLM to activate the model's learned safety patterns. In this process, the safety trigger is constrained to a single token, which effectively preserves model usability by introducing minimum intervention in the decoding process. Extensive experiments across diverse jailbreak attacks and benign prompts demonstrate that D-STT significantly reduces output harmfulness while preserving model usability and incurring negligible response time overhead, outperforming ten baseline methods.
翻译:大语言模型(LLMs)已被广泛应用于虚拟助手、自动化代码生成及科学研究等多个领域。然而,它们仍然容易受到越狱攻击的影响,这类攻击会操纵模型生成有害响应,即使模型已进行安全对齐。近期研究表明,当前经过安全对齐的LLMs仅实现了浅层安全对齐。在本研究中,我们深入探究了这一现象的内在机制,发现其通过习得的“安全触发令牌”显现——当这些令牌与特定输入配对时,会激活模型的安全模式。通过理论分析和实证验证,我们进一步证明了不同有害输入对应的安全触发令牌具有高度相似性。基于此,我们提出了D-STT算法,这是一种简洁而有效的防御方法:该算法通过识别并显式解码给定安全对齐LLM的安全触发令牌,以激活模型已习得的安全模式。在此过程中,安全触发被约束为单一令牌,通过对解码过程施加最小干预,有效保持了模型可用性。在多种越狱攻击和良性提示上的大量实验表明,D-STT在显著降低输出危害性的同时,能够保持模型可用性并引入可忽略的响应时间开销,其性能优于十种基线方法。