ALERT：基于内部差异放大的零样本大语言模型越狱检测 (ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification)

Despite rich safety alignment strategies, large language models (LLMs) remain highly susceptible to jailbreak attacks, which compromise safety guardrails and pose serious security risks. Existing detection methods mainly detect jailbreak status relying on jailbreak templates present in the training data. However, few studies address the more realistic and challenging zero-shot jailbreak detection setting, where no jailbreak templates are available during training. This setting better reflects real-world scenarios where new attacks continually emerge and evolve. To address this challenge, we propose a layer-wise, module-wise, and token-wise amplification framework that progressively magnifies internal feature discrepancies between benign and jailbreak prompts. We uncover safety-relevant layers, identify specific modules that inherently encode zero-shot discriminative signals, and localize informative safety tokens. Building upon these insights, we introduce ALERT (Amplification-based Jailbreak Detector), an efficient and effective zero-shot jailbreak detector that introduces two independent yet complementary classifiers on amplified representations. Extensive experiments on three safety benchmarks demonstrate that ALERT achieves consistently strong zero-shot detection performance. Specifically, (i) across all datasets and attack strategies, ALERT reliably ranks among the top two methods, and (ii) it outperforms the second-best baseline by at least 10% in average Accuracy and F1-score, and sometimes by up to 40%.

翻译：尽管已有丰富的安全对齐策略，大语言模型（LLMs）仍极易受到越狱攻击，这些攻击会破坏安全护栏并构成严重的安全风险。现有检测方法主要依赖训练数据中存在的越狱模板来检测越狱状态。然而，很少有研究关注更现实且更具挑战性的零样本越狱检测场景，即在训练阶段无法获得任何越狱模板。这一场景更好地反映了现实世界中新型攻击不断涌现和演变的实际情况。为应对这一挑战，我们提出了一种分层、分模块、分token的放大框架，逐步放大良性提示与越狱提示之间的内部特征差异。我们发现了与安全性相关的网络层，识别出固有编码零样本判别信号的具体模块，并定位了信息丰富的安全相关token。基于这些发现，我们提出了ALERT（基于放大的越狱检测器），这是一种高效且有效的零样本越狱检测器，其在放大后的表征上引入了两个独立但互补的分类器。在三个安全基准测试上的大量实验表明，ALERT始终展现出强大的零样本检测性能。具体而言，（i）在所有数据集和攻击策略下，ALERT均稳定位列前两种最佳方法之中；（ii）其平均准确率和F1分数至少优于次优基线方法10%，有时优势可达40%。