Multimodal large language models (MLLMs) are increasingly deployed in real-world systems, yet their safety under adversarial prompting remains underexplored. We present a two-phase evaluation of MLLM harmlessness using a fixed benchmark of 726 adversarial prompts authored by 26 professional red teamers. Phase 1 assessed GPT-4o, Claude Sonnet 3.5, Pixtral 12B, and Qwen VL Plus; Phase 2 evaluated their successors (GPT-5, Claude Sonnet 4.5, Pixtral Large, and Qwen Omni) yielding 82,256 human harm ratings. Large, persistent differences emerged across model families: Pixtral models were consistently the most vulnerable, whereas Claude models appeared safest due to high refusal rates. Attack success rates (ASR) showed clear alignment drift: GPT and Claude models exhibited increased ASR across generations, while Pixtral and Qwen showed modest decreases. Modality effects also shifted over time: text-only prompts were more effective in Phase 1, whereas Phase 2 produced model-specific patterns, with GPT-5 and Claude 4.5 showing near-equivalent vulnerability across modalities. These findings demonstrate that MLLM harmlessness is neither uniform nor stable across updates, underscoring the need for longitudinal, multimodal benchmarks to track evolving safety behaviour.
翻译:多模态大语言模型(MLLMs)正日益部署于实际系统中,但其在对抗性提示下的安全性仍未得到充分探索。我们采用由26名专业红队人员编写的726个对抗性提示构成的固定基准,对MLLMs的有害性进行了双阶段评估。第一阶段评估了GPT-4o、Claude Sonnet 3.5、Pixtral 12B和Qwen VL Plus;第二阶段评估了它们的后续版本(GPT-5、Claude Sonnet 4.5、Pixtral Large和Qwen Omni),共获得82,256个人工有害性评分。不同模型系列间存在显著且持续的差异:Pixtral模型始终最易受攻击,而Claude模型因高拒绝率表现出最高的安全性。攻击成功率(ASR)显示出明显的对齐漂移现象:GPT和Claude模型在代际更新中ASR持续上升,而Pixtral和Qwen模型则呈现小幅下降。模态效应也随时间发生变化:第一阶段中纯文本提示更为有效,而第二阶段则出现模型特异性模式,GPT-5和Claude 4.5在不同模态间表现出近乎同等的脆弱性。这些发现表明,MLLMs的有害性在模型更新过程中既非均匀分布也不稳定,凸显了需要建立纵向多模态基准来追踪持续演变的安全行为。