From static to adaptive: immune memory-based jailbreak detection for large language models

Large Language Models (LLMs) serve as the backbone of modern AI systems, yet they remain susceptible to adversarial jailbreak attacks. Consequently, robust detection of such malicious inputs is paramount for ensuring model safety. Traditional detection methods typically rely on external models trained on fixed, large-scale datasets, which often incur significant computational overhead. While recent methods shift toward leveraging internal safety signals of models to enable more lightweight and efficient detection. However, these methods remain inherently static and struggle to adapt to the evolving nature of jailbreak attacks. Drawing inspiration from the biological immune mechanism, we introduce the Immune Memory Adaptive Guard (IMAG) framework. By distilling and encoding safety patterns into a persistent, evolvable memory bank, IMAG enables adaptive generalization to emerging threats. Specifically, the framework orchestrates three synergistic components: Immune Detection, which employs retrieval for high-efficiency interception of known jailbreak attacks; Active Immunity, which performs proactive behavioral simulation to resolve ambiguous unknown queries; Memory Updating, which integrates validated attack patterns back into the memory bank. This closed-loop architecture transitions LLM defense from rigid filtering to autonomous adaptive mitigation. Extensive evaluations across five representative open-source LLMs demonstrate that our method surpasses state-of-the-art (SOTA) baselines, achieving a superior average detection accuracy of 94\% across diverse and complex attack types.

翻译：大型语言模型（LLMs）作为现代人工智能系统的核心，仍然容易受到对抗性越狱攻击的影响。因此，对这些恶意输入进行鲁棒检测对于确保模型安全至关重要。传统检测方法通常依赖于在固定的大规模数据集上训练的外部模型，这往往带来显著的计算开销。近期的方法则转向利用模型内部的安全信号，以实现更轻量、高效的检测。然而，这些方法本质上仍是静态的，难以适应越狱攻击不断演变的特性。受生物免疫机制的启发，我们提出了免疫记忆自适应防护（IMAG）框架。通过将安全模式提炼并编码到一个持久、可演化的记忆库中，IMAG能够对新兴威胁实现自适应泛化。具体而言，该框架协调三个协同组件：免疫检测，利用检索机制高效拦截已知的越狱攻击；主动免疫，通过主动行为模拟来解析模糊的未知查询；记忆更新，将经过验证的攻击模式整合回记忆库。这种闭环架构将LLM防御从刚性过滤转变为自主自适应缓解。在五个具有代表性的开源LLM上进行的大量评估表明，我们的方法超越了当前最先进的基线模型，在多样且复杂的攻击类型中实现了平均94%的优异检测准确率。