Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning tasks through enhanced chain-of-thought capabilities. However, this advancement also introduces novel safety risks, as these models become increasingly vulnerable to harmful multimodal prompts that can trigger unethical or unsafe behaviors. Existing safety alignment approaches, primarily designed for unimodal language models, fall short in addressing the complex and nuanced threats posed by multimodal inputs. Moreover, current safety datasets lack the fine-grained, policy-grounded reasoning required to robustly align reasoning-capable VLMs. In this work, we introduce {MSR-Align}, a high-quality Multimodal Safety Reasoning dataset tailored to bridge this gap. MSR-Align supports fine-grained, deliberative reasoning over standardized safety policies across both vision and text modalities. Our data generation pipeline emphasizes multimodal diversity, policy-grounded reasoning, and rigorous quality filtering using strong multimodal judges. Extensive experiments demonstrate that fine-tuning VLMs on MSR-Align substantially improves robustness against both textual and vision-language jailbreak attacks, while preserving or enhancing general reasoning performance. MSR-Align provides a scalable and effective foundation for advancing the safety alignment of reasoning-capable VLMs. Our dataset is made publicly available at https://huggingface.co/datasets/Leigest/MSR-Align.
翻译:视觉语言模型(VLMs)通过增强的思维链能力在多模态推理任务中取得了显著进展。然而,这一进步也带来了新的安全风险,因为这些模型越来越容易受到有害多模态提示的影响,从而引发不道德或不安全的行为。现有的安全对齐方法主要针对单模态语言模型设计,难以应对多模态输入带来的复杂且微妙的安全威胁。此外,当前的安全数据集缺乏细粒度、基于策略的推理能力,无法有效对齐具备推理能力的视觉语言模型。为此,我们提出了MSR-Align,一个高质量的多模态安全推理数据集,旨在弥补这一不足。MSR-Align支持在视觉和文本模态上对标准化安全策略进行细粒度的审慎推理。我们的数据生成流程强调多模态多样性、基于策略的推理,并利用强大的多模态评判器进行严格的质量过滤。大量实验表明,在MSR-Align上微调视觉语言模型能显著提升模型对文本及视觉语言越狱攻击的鲁棒性,同时保持或增强其通用推理性能。MSR-Align为推进具备推理能力的视觉语言模型的安全对齐提供了一个可扩展且有效的基础。我们的数据集已在 https://huggingface.co/datasets/Leigest/MSR-Align 公开提供。