Multi-modal Large Language Models (MLLMs) are increasingly deployed in interactive applications. However, their safety vulnerabilities become pronounced in multi-turn multi-modal scenarios, where harmful intent can be gradually reconstructed across turns, and security protocols fade into oblivion as the conversation progresses. Existing Reinforcement Learning from Human Feedback (RLHF) alignment methods are largely developed for single-turn visual question-answer (VQA) task and often require costly manual preference annotations, limiting their effectiveness and scalability in dialogues. To address this challenge, we present InterSafe-V, an open-source multi-modal dialogue dataset containing 11,270 dialogues and 500 specially designed refusal VQA samples. This dataset, constructed through interaction between several models, is designed to more accurately reflect real-world scenarios and includes specialized VQA pairs tailored for specific domains. Building on this dataset, we propose AM$^3$Safety, a framework that combines a cold-start refusal phase with Group Relative Policy Optimization (GRPO) fine-tuning using turn-aware dual-objective rewards across entire dialogues. Experiments on Qwen2.5-VL-7B-Instruct and LLaVA-NeXT-7B show more than 10\% decrease in Attack Success Rate (ASR) together with an increment of at least 8\% in harmless dimension and over 13\% in helpful dimension of MLLMs on multi-modal multi-turn safety benchmarks, while preserving their general abilities.
翻译:多模态大语言模型(MLLMs)正越来越多地被部署于交互式应用中。然而,在多轮多模态场景下,其安全脆弱性变得尤为突出:有害意图可能在多轮对话中被逐步重构,且随着对话的推进,安全协议往往被遗忘。现有基于人类反馈的强化学习(RLHF)对齐方法主要针对单轮视觉问答(VQA)任务开发,通常需要昂贵的人工偏好标注,限制了其在对话场景中的有效性和可扩展性。为应对这一挑战,我们提出了InterSafe-V,一个开源的多模态对话数据集,包含11,270个对话和500个专门设计的拒答VQA样本。该数据集通过多个模型间的交互构建而成,旨在更准确地反映真实场景,并包含针对特定领域定制的VQA对。基于此数据集,我们提出了AM$^3$Safety框架,该框架结合了冷启动拒答阶段与基于全对话轮次感知双目标奖励的组相对策略优化(GRPO)微调。在Qwen2.5-VL-7B-Instruct和LLaVA-NeXT-7B上的实验表明,在多模态多轮安全基准测试中,模型的攻击成功率(ASR)降低了超过10%,同时在无害性维度上提升了至少8%,在帮助性维度上提升了超过13%,同时保持了其通用能力。