The rapid evolution of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has produced substantial gains in reasoning, perception, and generative capability across language and vision. However, whether these advances yield commensurate improvements in safety remains unclear, in part due to fragmented evaluation practices limited to single modalities or threat models. In this report, we present an integrated safety evaluation of 7 frontier models: GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5. We evaluate each model across language, vision-language, and image generation settings using a unified protocol that integrates benchmark evaluation, adversarial evaluation, multilingual evaluation, and compliance evaluation. Aggregating our evaluations into safety leaderboards and model safety profiles across multiple evaluation modes reveals a sharply heterogeneous safety landscape. While GPT-5.2 demonstrates consistently strong and balanced safety performance across evaluations, other models exhibit pronounced trade-offs among benchmark safety, adversarial alignment, multilingual generalization, and regulatory compliance. Both language and vision-language modalities show significant vulnerability under adversarial evaluation, with all models degrading substantially despite strong results on standard benchmarks. Text-to-image models achieve relatively stronger alignment in regulated visual risk categories, yet remain brittle under adversarial or semantically ambiguous prompts. Overall, these results show that safety in frontier models is inherently multidimensional--shaped by modality, language, and evaluation scheme, underscoring the need for standardized safety evaluations to accurately assess real-world risk and guide responsible model development and deployment.
翻译:大型语言模型(LLMs)与多模态大型语言模型(MLLMs)的快速发展,在语言与视觉领域的推理、感知与生成能力方面带来了显著提升。然而,这些进步是否在安全性方面带来了相应的改善尚不明确,部分原因在于当前评估实践较为零散,通常局限于单一模态或威胁模型。本报告对七款前沿模型——GPT-5.2、Gemini 3 Pro、Qwen3-VL、Doubao 1.8、Grok 4.1 Fast、Nano Banana Pro与Seedream 4.5——进行了综合安全评估。我们采用统一的评估协议,涵盖基准测试、对抗性评估、多语言评估与合规性评估,在语言、视觉-语言及图像生成三种场景下对每个模型进行了测试。将各项评估结果汇总为多模式下的安全排行榜与模型安全画像后,揭示出高度异质化的安全格局。尽管GPT-5.2在所有评估中均展现出持续强劲且均衡的安全性能,其他模型则在基准安全、对抗对齐、多语言泛化与法规遵从性之间表现出明显的权衡关系。在对抗性评估下,语言与视觉-语言模态均显示出显著的脆弱性,所有模型尽管在标准基准测试中表现良好,但性能均出现大幅下降。文本到图像模型在受监管的视觉风险类别中实现了相对更强的对齐性,但在对抗性或语义模糊的提示下仍显脆弱。总体而言,这些结果表明前沿模型的安全性本质上是多维度的——受模态、语言与评估方案共同影响,这凸显了标准化安全评估的必要性,以准确评估现实世界风险并指导负责任的模型开发与部署。