Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. To address this, existing detection methods either learn attack-specific parameters, which hinders generalization to unseen attacks, or rely on heuristically sound principles, which limit accuracy and efficiency. To overcome these limitations, we propose Learning to Detect (LoD), a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning. This framework includes a Multi-modal Safety Concept Activation Vector module for safety-oriented representation learning and a Safety Pattern Auto-Encoder module for unsupervised attack classification. Extensive experiments show that our method achieves consistently higher detection AUROC on diverse unknown attacks while improving efficiency. The code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.
翻译:尽管进行了广泛的校准工作,大型视觉语言模型(LVLMs)仍然容易受到越狱攻击,构成严重的安全风险。为解决这一问题,现有检测方法要么学习攻击特定的参数,这阻碍了对未见攻击的泛化能力;要么依赖启发式的安全原则,这限制了检测的准确性和效率。为克服这些局限,我们提出了学习检测(LoD)框架,这是一个通过将焦点从攻击特定学习转向任务特定学习来准确检测未知越狱攻击的通用框架。该框架包含一个用于安全导向表示学习的多模态安全概念激活向量模块,以及一个用于无监督攻击分类的安全模式自编码器模块。大量实验表明,我们的方法在多种未知攻击上实现了持续更高的检测AUROC,同时提升了效率。代码可在 https://anonymous.4open.science/r/Learning-to-Detect-51CB 获取。