Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks. To mitigate these risks, existing detection methods are essential, yet they face two major challenges: generalization and accuracy. While learning-based methods trained on specific attacks fail to generalize to unseen attacks, learning-free methods based on hand-crafted heuristics suffer from limited accuracy and reduced efficiency. To address these limitations, we propose Learning to Detect (LoD), a learnable framework that eliminates the need for any attack data or hand-crafted heuristics. LoD operates by first extracting layer-wise safety representations directly from the model's internal activations using Multi-modal Safety Concept Activation Vectors classifiers, and then converting the high-dimensional representations into a one-dimensional anomaly score for detection via a Safety Pattern Auto-Encoder. Extensive experiments demonstrate that LoD consistently achieves state-of-the-art detection performance (AUROC) across diverse unseen jailbreak attacks on multiple LVLMs, while also significantly improving efficiency. Code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.
翻译:尽管进行了广泛的校准工作,大型视觉语言模型(LVLMs)仍然容易受到越狱攻击。为降低这些风险,现有的检测方法至关重要,但它们面临两大挑战:泛化性与准确性。基于特定攻击训练的学习方法难以泛化至未见攻击,而基于人工启发式规则的无学习方法则存在准确性有限和效率降低的问题。为克服这些局限性,我们提出了"学习检测"(LoD)框架,这是一种无需任何攻击数据或人工启发式规则的可学习框架。LoD首先通过多模态安全概念激活向量分类器直接从模型内部激活中提取层级安全表征,随后通过安全模式自编码器将这些高维表征转换为一维异常分数以进行检测。大量实验表明,LoD在多种LVLMs上针对不同未见越狱攻击始终实现最先进的检测性能(AUROC),同时显著提升了检测效率。代码发布于 https://anonymous.4open.science/r/Learning-to-Detect-51CB。