Large language models (LLMs) have revolutionized various applications, making robust safety alignment essential to prevent harmful outputs. Current safety alignment techniques, however, harbor inherent vulnerabilities due to their reliance on logit suppression. In this work, we identify critical logit-level vulnerabilities by introducing Semantic-sensitive Alignment and Generation (SSAG), a method designed to systematically manipulate output-layer logits without altering model parameters. Experiments on five popular LLMs show that SSAG exposes harmful responses with a 95% success rate while reducing response time by 86%. VulMine also demonstrates superior attack efficacy, achieving an average ASR of up to 77% against strong defensive mechanisms. These findings reveal crucial weaknesses in existing alignment methods, highlighting an urgent need for improved vulnerability detection and robust safety alignment strategies. Our code is available on github.
翻译:大语言模型(LLMs)已革新各类应用场景,使得稳健的安全对齐对防止有害输出至关重要。然而,当前的安全对齐技术因其依赖对数抑制(logit suppression)而存在固有脆弱性。本研究通过引入语义敏感对齐与生成(SSAG)方法,识别出关键的对数级别脆弱性,该方法能在不改变模型参数的前提下系统性地操控输出层对数。在五个主流LLMs上的实验表明,SSAG以95%的成功率暴露出有害响应,同时将响应时间降低86%。VulMine同样展现出卓越的攻击效能,针对强防御机制实现了平均高达77%的攻击成功率(ASR)。这些发现揭示了现有对齐方法中的关键缺陷,凸显了改进脆弱性检测与构建稳健安全对齐策略的迫切需求。我们的代码已在GitHub公开。