Large Vision-Language Models (VLMs) have achieved remarkable performance across a wide range of tasks. However, their deployment in safety-critical domains poses significant challenges. Existing safety fine-tuning methods, which focus on textual or multimodal content, fall short in addressing challenging cases or disrupt the balance between helpfulness and harmlessness. Our evaluation highlights a safety reasoning gap: these methods lack safety visual reasoning ability, leading to such bottlenecks. To address this limitation and enhance both visual perception and reasoning in safety-critical contexts, we propose a novel dataset that integrates multi-image inputs with safety Chain-of-Thought (CoT) labels as fine-grained reasoning logic to improve model performance. Specifically, we introduce the Multi-Image Safety (MIS) dataset, an instruction-following dataset tailored for multi-image safety scenarios, consisting of training and test splits. Our experiments demonstrate that fine-tuning InternVL2.5-8B with MIS significantly outperforms both powerful open-source models and API-based models in challenging multi-image tasks requiring safety-related visual reasoning. This approach not only delivers exceptional safety performance but also preserves general capabilities without any trade-offs. Specifically, fine-tuning with MIS increases average accuracy by 0.83% across five general benchmarks and reduces the Attack Success Rate (ASR) on multiple safety benchmarks by a large margin.
翻译:大型视觉语言模型(VLMs)已在广泛任务中展现出卓越性能。然而,其在安全关键领域的部署仍面临重大挑战。现有聚焦于文本或多模态内容的安全微调方法,在应对复杂案例时存在不足,或破坏了模型有用性与无害性之间的平衡。我们的评估揭示了一个安全推理鸿沟:这些方法缺乏安全视觉推理能力,从而形成了此类瓶颈。为突破这一局限并增强安全关键场景下的视觉感知与推理能力,我们提出了一种融合多图像输入与安全思维链(CoT)标注的新型数据集,通过细粒度推理逻辑提升模型性能。具体而言,我们构建了面向多图像安全场景的指令跟随数据集——多图像安全(MIS)数据集,包含训练集与测试集。实验表明,使用MIS对InternVL2.5-8B进行微调后,在需要安全相关视觉推理的复杂多图像任务中,其性能显著优于主流开源模型及基于API的模型。该方法不仅实现了卓越的安全性能,同时完全保持了模型的通用能力。具体而言,经MIS微调的模型在五项通用基准测试中平均准确率提升0.83%,并在多个安全基准上大幅降低了攻击成功率(ASR)。