Large language models (LLMs) remain vulnerable to jailbreak prompts that elicit harmful or policy-violating outputs, while many existing defenses rely on expensive fine-tuning, intrusive prompt rewriting, or external guardrails that add latency and can degrade helpfulness. We present AISA, a lightweight, single-pass defense that activates safety behaviors already latent inside the model rather than treating safety as an add-on. AISA first localizes intrinsic safety awareness via spatiotemporal analysis and shows that intent-discriminative signals are broadly encoded, with especially strong separability appearing in the scaled dot-product outputs of specific attention heads near the final structural tokens before generation. Using a compact set of automatically selected heads, AISA extracts an interpretable prompt-risk score with minimal overhead, achieving detector-level performance competitive with strong proprietary baselines on small (7B) models. AISA then performs logits-level steering: it modulates the decoding distribution in proportion to the inferred risk, ranging from normal generation for benign prompts to calibrated refusal for high-risk requests -- without changing model parameters, adding auxiliary modules, or requiring multi-pass inference. Extensive experiments spanning 13 datasets, 12 LLMs, and 14 baselines demonstrate that AISA improves robustness and transfer while preserving utility and reducing false refusals, enabling safer deployment even for weakly aligned or intentionally risky model variants.
翻译:大型语言模型(LLMs)在面对诱导其产生有害或违反政策内容的越狱提示时仍显脆弱,而现有防御方法多依赖于昂贵的微调、侵入式的提示重写或引入延迟且可能降低模型实用性的外部防护机制。本文提出AISA,一种轻量级单次推理防御方法,其核心在于激活模型内部固有的安全机制,而非将安全性作为外部附加组件。AISA首先通过时空分析定位内在安全认知,研究表明意图判别信号被广泛编码于模型中,其中在生成前最终结构标记附近的特定注意力头的缩放点积输出中表现出尤其显著的分离性。利用一组自动选取的紧凑注意力头集合,AISA以极低开销提取出可解释的提示风险分数,在小型(7B)模型上实现了与强有力专有基线相媲美的检测器级性能。随后AISA执行对数级别引导:根据推断的风险程度按比例调整解码分布,其调控范围从良性提示的正常生成到高风险请求的校准拒绝——整个过程无需修改模型参数、添加辅助模块或进行多次推理。覆盖13个数据集、12种LLM和14个基线的广泛实验表明,AISA在保持实用性和降低误拒率的同时,显著提升了模型的鲁棒性与迁移防御能力,即使对于弱对齐或故意设计为高风险的模型变体也能实现更安全的部署。