The rapid evolution of generative models has led to a continuous emergence of multimodal safety risks, exposing the limitations of existing defense methods. To address these challenges, we propose ProGuard, a vision-language proactive guard that identifies and describes out-of-distribution (OOD) safety risks without the need for model adjustments required by traditional reactive approaches. We first construct a modality-balanced dataset of 87K samples, each annotated with both binary safety labels and risk categories under a hierarchical multimodal safety taxonomy, effectively mitigating modality bias and ensuring consistent moderation across text, image, and text-image inputs. Based on this dataset, we train our vision-language base model purely through reinforcement learning (RL) to achieve efficient and concise reasoning. To approximate proactive safety scenarios in a controlled setting, we further introduce an OOD safety category inference task and augment the RL objective with a synonym-bank-based similarity reward that encourages the model to generate concise descriptions for unseen unsafe categories. Experimental results show that ProGuard achieves performance comparable to closed-source large models on binary safety classification, substantially outperforms existing open-source guard models on unsafe content categorization. Most notably, ProGuard delivers a strong proactive moderation ability, improving OOD risk detection by 52.6% and OOD risk description by 64.8%.
翻译:生成模型的快速发展导致多模态安全风险不断涌现,暴露出现有防御方法的局限性。为应对这些挑战,我们提出ProGuard——一种视觉语言主动防护系统,能够在不依赖传统反应式方法所需模型调整的情况下,识别并描述分布外(OOD)安全风险。我们首先构建了包含87K样本的模态平衡数据集,每个样本均通过分级多模态安全分类体系标注了二元安全标签与风险类别,有效缓解了模态偏差并确保了对文本、图像及图文混合输入的一致性审核。基于此数据集,我们通过纯强化学习(RL)训练视觉语言基础模型,以实现高效简洁的推理。为在受控环境中模拟主动安全场景,我们进一步引入OOD安全类别推断任务,并采用基于同义词库的相似性奖励增强RL目标,激励模型为未见的不安全类别生成简洁描述。实验结果表明,ProGuard在二元安全分类任务上达到与闭源大模型相当的性能,在不安全内容分类任务上显著优于现有开源防护模型。尤为突出的是,ProGuard展现出强大的主动审核能力,将OOD风险检测性能提升52.6%,OOD风险描述性能提升64.8%。