The rapid evolution of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has driven major gains in reasoning, perception, and generation across language and vision, yet whether these advances translate into comparable improvements in safety remains unclear, partly due to fragmented evaluations that focus on isolated modalities or threat models. In this report, we present an integrated safety evaluation of six frontier models--GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5--assessing each across language, vision-language, and image generation using a unified protocol that combines benchmark, adversarial, multilingual, and compliance evaluations. By aggregating results into safety leaderboards and model profiles, we reveal a highly uneven safety landscape: while GPT-5.2 demonstrates consistently strong and balanced performance, other models exhibit clear trade-offs across benchmark safety, adversarial robustness, multilingual generalization, and regulatory compliance. Despite strong results under standard benchmarks, all models remain highly vulnerable under adversarial testing, with worst-case safety rates dropping below 6%. Text-to-image models show slightly stronger alignment in regulated visual risk categories, yet remain fragile when faced with adversarial or semantically ambiguous prompts. Overall, these findings highlight that safety in frontier models is inherently multidimensional--shaped by modality, language, and evaluation design--underscoring the need for standardized, holistic safety assessments to better reflect real-world risk and guide responsible deployment.
翻译:大语言模型与多模态大语言模型的快速发展,在语言和视觉领域的推理、感知与生成能力上取得了重大进展。然而,这些进步是否转化为安全性的同等提升尚不明确,部分原因在于现有评估体系较为零散,往往只关注单一模态或特定威胁模型。本报告对六个前沿模型——GPT-5.2、Gemini 3 Pro、Qwen3-VL、Grok 4.1 Fast、Nano Banana Pro与Seedream 4.5——进行了综合安全评估。我们采用统一评估协议,结合基准测试、对抗性测试、多语言测试及合规性评估,对每个模型在语言、视觉-语言及图像生成三个维度进行了全面测评。通过将结果汇总至安全排行榜与模型能力剖面图,我们揭示了一个高度不均衡的安全格局:GPT-5.2展现出持续强劲且均衡的性能,而其他模型则在基准安全性、对抗鲁棒性、多语言泛化能力及法规遵从性方面存在明显的权衡取舍。尽管所有模型在标准基准测试中表现良好,但在对抗性测试下仍高度脆弱,最差情况下的安全率降至6%以下。文本到图像模型在受监管的视觉风险类别中表现出稍强的对齐性,但在面对对抗性或语义模糊的提示时依然脆弱。总体而言,这些发现表明前沿模型的安全性本质上是多维度的——受模态、语言和评估设计共同影响——这凸显了建立标准化、整体性安全评估体系的必要性,以更准确地反映现实世界风险并指导负责任的技术部署。