TrapSuffix: Proactive Defense Against Adversarial Suffixes in Jailbreaking

Suffix-based jailbreak attacks append an adversarial suffix, i.e., a short token sequence, to steer aligned LLMs into unsafe outputs. Since suffixes are free-form text, they admit endlessly many surface forms, making jailbreak mitigation difficult. Most existing defenses depend on passive detection of suspicious suffixes, without leveraging the defender's inherent asymmetric ability to inject secrets and proactively conceal gaps. Motivated by this, we take a controllability-oriented perspective and develop a proactive defense that nudges attackers into a no-win dilemma: either they fall into defender-designed optimization traps and fail to produce an effective adversarial suffix, or they can succeed only by generating adversarial suffixes that carry distinctive, traceable fingerprints. We propose TrapSuffix, a lightweight fine-tuning approach that injects trap-aligned behaviors into the base model without changing the inference pipeline. TrapSuffix channels jailbreak attempts into these two outcomes by reshaping the model's response landscape to adversarial suffixes. Across diverse suffix-based jailbreak settings, TrapSuffix reduces the average attack success rate to below 0.01 percent and achieves an average tracing success rate of 87.9 percent, providing both strong defense and reliable traceability. It introduces no inference-time overhead and incurs negligible memory cost, requiring only 15.87 MB of additional memory on average, whereas state-of-the-art LLM-based detection defenses typically incur memory overheads at the 1e4 MB level, while composing naturally with existing filtering-based defenses for complementary protection.

翻译：基于后缀的越狱攻击通过附加一个对抗性后缀（即一个简短的令牌序列）来引导对齐的大型语言模型产生不安全的输出。由于后缀是自由形式的文本，其表面形式可以无限变化，这使得越狱缓解变得困难。大多数现有防御依赖于被动检测可疑后缀，未能利用防御者固有的、可注入秘密并主动隐藏漏洞的不对称能力。受此启发，我们采取一种面向可控性的视角，开发了一种主动防御，将攻击者推向一个无胜算的困境：要么落入防御者设计的优化陷阱而无法生成有效的对抗性后缀，要么只能通过生成携带独特、可追踪指纹的对抗性后缀才能成功。我们提出了TrapSuffix，一种轻量级的微调方法，可在不改变推理流程的情况下，将陷阱对齐行为注入基础模型。TrapSuffix通过重塑模型对对抗性后缀的响应态势，将越狱尝试引导至这两种结果。在多种基于后缀的越狱攻击场景中，TrapSuffix将平均攻击成功率降低至0.01%以下，并实现了87.9%的平均追踪成功率，同时提供了强大的防御和可靠的溯源性。该方法不引入推理时间开销，内存成本可忽略不计，平均仅需额外15.87 MB内存，而最先进的基于LLM的检测防御通常会产生1e4 MB级别的内存开销。此外，TrapSuffix能与现有的基于过滤的防御自然结合，提供互补的保护。