Vision-language models (VLMs) are essential for contextual understanding of both visual and textual information. However, their vulnerability to adversarially manipulated inputs presents significant risks, leading to compromised outputs and raising concerns about the reliability in VLM-integrated applications. Detecting these malicious prompts is thus crucial for maintaining trust in VLM generations. A major challenge in developing a safeguarding prompt classifier is the lack of a large amount of labeled benign and malicious data. To address the issue, we introduce VLMGuard, a novel learning framework that leverages the unlabeled user prompts in the wild for malicious prompt detection. These unlabeled prompts, which naturally arise when VLMs are deployed in the open world, consist of both benign and malicious information. To harness the unlabeled data, we present an automated maliciousness estimation score for distinguishing between benign and malicious samples within this unlabeled mixture, thereby enabling the training of a binary prompt classifier on top. Notably, our framework does not require extra human annotations, offering strong flexibility and practicality for real-world applications. Extensive experiment shows VLMGuard achieves superior detection results, significantly outperforming state-of-the-art methods. Disclaimer: This paper may contain offensive examples; reader discretion is advised.
翻译:视觉语言模型(VLMs)对于视觉与文本信息的上下文理解至关重要。然而,其易受对抗性操纵输入影响的脆弱性带来了显著风险,可能导致输出结果被破坏,并引发对VLM集成应用可靠性的担忧。因此,检测这些恶意提示对于维持VLM生成内容的可信度至关重要。开发防护性提示分类器面临的一个主要挑战是缺乏大量标注的良性及恶意数据。为解决此问题,我们提出了VLMGuard——一种新颖的学习框架,该框架利用现实场景中的未标注用户提示进行恶意提示检测。这些未标注提示在VLM部署于开放环境时自然产生,同时包含良性与恶意信息。为有效利用未标注数据,我们提出了一种自动化恶意度评估分数,用以区分未标注混合数据中的良性样本与恶意样本,从而在此基础上训练二元提示分类器。值得注意的是,本框架无需额外人工标注,为实际应用提供了强大的灵活性与实用性。大量实验表明,VLMGuard实现了卓越的检测效果,显著优于现有最先进方法。免责声明:本文可能包含冒犯性示例,建议读者谨慎阅读。