SAID: Empowering Large Language Models with Self-Activating Internal Defense

Large Language Models (LLMs), despite advances in safety alignment, remain vulnerable to jailbreak attacks designed to circumvent protective mechanisms. Prevailing defense strategies rely on external interventions, such as input filtering or output modification, which often lack generalizability and compromise model utility while incurring significant computational overhead. In this work, we introduce a new, training-free defense paradigm, Self-Activating Internal Defense (SAID), which reframes the defense task from external correction to internal capability activation. SAID uniquely leverages the LLM's own reasoning abilities to proactively identify and neutralize malicious intent through a three-stage pipeline: model-native intent distillation to extract core semantics, optimal safety prefix probing to activate latent safety awareness, and a conservative aggregation strategy to ensure robust decision-making. Extensive experiments on five open-source LLMs against six advanced jailbreak attacks demonstrate that SAID substantially outperforms state-of-the-art defenses in reducing harmful outputs. Crucially, it achieves this while preserving model performance on benign tasks and incurring minimal computational overhead. Our work establishes that activating the intrinsic safety mechanisms of LLMs is a more robust and scalable path toward building safer and more reliable aligned AI systems.

翻译：尽管大语言模型（LLMs）在安全对齐方面取得了进展，其仍易受到旨在规避保护机制的越狱攻击。现有防御策略多依赖外部干预，如输入过滤或输出修改，这类方法通常泛化能力不足，在产生显著计算开销的同时还会损害模型效用。本研究提出一种无需训练的新型防御范式——自激活内部防御（SAID），其将防御任务从外部修正重构为内部能力激活。SAID独特地利用大语言模型自身的推理能力，通过三阶段流程主动识别并化解恶意意图：模型原生意图蒸馏以提取核心语义、最优安全前缀探测以激活潜在安全认知，以及保守聚合策略以确保稳健决策。在五个开源大语言模型上针对六种先进越狱攻击的广泛实验表明，SAID在减少有害输出方面显著优于现有最先进的防御方法。关键的是，它在实现这一目标的同时，保持了模型在良性任务上的性能，且仅产生极小的计算开销。我们的工作证实，激活大语言模型的内在安全机制是构建更安全、更可靠的对齐人工智能系统的一条更稳健且可扩展的路径。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日