Large Language Models (LLMs), despite advances in safety alignment, remain vulnerable to jailbreak attacks designed to circumvent protective mechanisms. Prevailing defense strategies rely on external interventions, such as input filtering or output modification, which often lack generalizability and compromise model utility while incurring significant computational overhead. In this work, we introduce a new, training-free defense paradigm, Self-Activating Internal Defense (SAID), which reframes the defense task from external correction to internal capability activation. SAID uniquely leverages the LLM's own reasoning abilities to proactively identify and neutralize malicious intent through a three-stage pipeline: model-native intent distillation to extract core semantics, optimal safety prefix probing to activate latent safety awareness, and a conservative aggregation strategy to ensure robust decision-making. Extensive experiments on five open-source LLMs against six advanced jailbreak attacks demonstrate that SAID substantially outperforms state-of-the-art defenses in reducing harmful outputs. Crucially, it achieves this while preserving model performance on benign tasks and incurring minimal computational overhead. Our work establishes that activating the intrinsic safety mechanisms of LLMs is a more robust and scalable path toward building safer and more reliable aligned AI systems.
翻译:尽管大语言模型(LLMs)在安全对齐方面取得了进展,其仍易受到旨在规避保护机制的越狱攻击。现有防御策略多依赖外部干预,如输入过滤或输出修改,这类方法通常泛化能力不足,在产生显著计算开销的同时还会损害模型效用。本研究提出一种无需训练的新型防御范式——自激活内部防御(SAID),其将防御任务从外部修正重构为内部能力激活。SAID独特地利用大语言模型自身的推理能力,通过三阶段流程主动识别并化解恶意意图:模型原生意图蒸馏以提取核心语义、最优安全前缀探测以激活潜在安全认知,以及保守聚合策略以确保稳健决策。在五个开源大语言模型上针对六种先进越狱攻击的广泛实验表明,SAID在减少有害输出方面显著优于现有最先进的防御方法。关键的是,它在实现这一目标的同时,保持了模型在良性任务上的性能,且仅产生极小的计算开销。我们的工作证实,激活大语言模型的内在安全机制是构建更安全、更可靠的对齐人工智能系统的一条更稳健且可扩展的路径。