基于表征对比评分的大型视觉语言模型越狱检测再思考 (Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring)

Large Vision-Language Models (LVLMs) are vulnerable to a growing array of multimodal jailbreak attacks, necessitating defenses that are both generalizable to novel threats and efficient for practical deployment. Many current strategies fall short, either targeting specific attack patterns, which limits generalization, or imposing high computational overhead. While lightweight anomaly-detection methods offer a promising direction, we find that their common one-class design tends to confuse novel benign inputs with malicious ones, leading to unreliable over-rejection. To address this, we propose Representational Contrastive Scoring (RCS), a framework built on a key insight: the most potent safety signals reside within the LVLM's own internal representations. Our approach inspects the internal geometry of these representations, learning a lightweight projection to maximally separate benign and malicious inputs in safety-critical layers. This enables a simple yet powerful contrastive score that differentiates true malicious intent from mere novelty. Our instantiations, MCD (Mahalanobis Contrastive Detection) and KCD (K-nearest Contrastive Detection), achieve state-of-the-art performance on a challenging evaluation protocol designed to test generalization to unseen attack types. This work demonstrates that effective jailbreak detection can be achieved by applying simple, interpretable statistical methods to the appropriate internal representations, offering a practical path towards safer LVLM deployment. Our code is available on Github https://github.com/sarendis56/Jailbreak_Detection_RCS.

翻译：大型视觉语言模型（LVLMs）日益面临多种模态越狱攻击的威胁，亟需兼具新型威胁泛化能力与实际部署高效性的防御机制。现有策略多有不足：或局限于特定攻击模式而泛化能力有限，或计算开销过高难以实用。尽管轻量级异常检测方法展现出潜力，我们发现其常见的单类别设计易将新型良性输入误判为恶意输入，导致不可靠的过度拒绝。为此，我们提出表征对比评分（RCS）框架，其核心洞见在于：最有效的安全信号蕴藏在LVLM自身的内部表征中。该方法通过解析这些表征的内部几何结构，学习轻量级投影以在安全关键层中最大化分离良性输入与恶意输入，从而构建出简洁而强大的对比评分机制，能够有效区分真实恶意意图与单纯的新颖性输入。我们提出的具体实现方案——马氏距离对比检测（MCD）与K近邻对比检测（KCD）——在专为测试未知攻击类型泛化能力设计的评估协议中取得了最先进的性能。本研究表明，通过对适当的内部表征应用简洁可解释的统计方法即可实现有效的越狱检测，为LVLM的安全部署提供了实用路径。代码已开源：https://github.com/sarendis56/Jailbreak_Detection_RCS。