Large language models (LLMs) have achieved impressive performance across natural language tasks and are increasingly deployed in real-world applications. Despite extensive safety alignment efforts, recent studies show that such alignment is often shallow and remains vulnerable to jailbreak attacks. Existing defense mechanisms, including decoding-based constraints and post-hoc content detectors, struggle against sophisticated jailbreaks, often intervening robust detection or excessively degrading model utility. In this work, we examine the decoding process of LLMs and make a key observation: even when successfully jailbroken, models internally exhibit latent safety-related signals during generation. However, these signals are overridden by the model's drive for fluent continuation, preventing timely self-correction or refusal. Building on this observation, we propose a simple yet effective approach that explicitly surfaces and leverages these latent safety signals for early detection of unsafe content during decoding. Experiments across diverse jailbreak attacks demonstrate that our approach significantly enhances safety, while maintaining low over-refusal rates on benign inputs and preserving response quality. Our results suggest that activating intrinsic safety-awareness during decoding offers a promising and complementary direction for defending against jailbreak attacks. Code is available at: https://github.com/zyz13590/SafeProbing.
翻译:大型语言模型(LLMs)在自然语言任务中展现出卓越性能,并日益广泛地应用于现实场景。尽管已投入大量精力进行安全对齐,近期研究表明此类对齐往往较为浅层,仍易受越狱攻击的影响。现有防御机制(包括基于解码的约束和事后内容检测器)难以应对复杂的越狱手段,常表现为检测鲁棒性不足或过度损害模型实用性。本研究通过分析LLMs的解码过程发现关键现象:即使模型被成功越狱,其在生成过程中内部仍会呈现潜在的安全相关信号。然而,这些信号常被模型追求流畅续写的驱动力所压制,导致无法及时进行自我修正或拒绝响应。基于此发现,我们提出一种简洁而有效的方法,通过显式提取并利用这些潜在安全信号,在解码阶段实现对不安全内容的早期检测。在多种越狱攻击场景下的实验表明,该方法能显著提升安全性,同时在良性输入上保持较低的误拒率,并维持响应质量。研究结果表明,在解码过程中激活模型内在的安全感知能力,为防御越狱攻击提供了具有前景的补充性研究方向。代码已开源:https://github.com/zyz13590/SafeProbing。