Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety

Xingjun Ma,Yifeng Gao,Yixu Wang,Ruofan Wang,Xin Wang,Ye Sun,Yifan Ding,Hengyuan Xu,Yunhao Chen,Yunhan Zhao,Hanxun Huang,Yige Li,Yutao Wu,Jiaming Zhang,Xiang Zheng,Yang Bai,Zuxuan Wu,Xipeng Qiu,Jingfeng Zhang,Yiming Li,Xudong Han,Haonan Li,Jun Sun,Cong Wang,Jindong Gu,Baoyuan Wu,Siheng Chen,Tianwei Zhang,Yang Liu,Mingming Gong,Tongliang Liu,Shirui Pan,Cihang Xie,Tianyu Pang,Yinpeng Dong,Ruoxi Jia,Yang Zhang,Shiqing Ma,Xiangyu Zhang,Neil Gong,Chaowei Xiao,Sarah Erfani,Tim Baldwin,Bo Li,Masashi Sugiyama,Dacheng Tao,James Bailey,Yu-Gang Jiang

from arxiv, 706 papers, 60 pages, 3 figures, 14 tables; GitHub: https://github.com/xingjunm/Awesome-Large-Model-Safety

The rapid advancement of large models, driven by their exceptional abilities in learning and generalization through large-scale pre-training, has reshaped the landscape of Artificial Intelligence (AI). These models are now foundational to a wide range of applications, including conversational AI, recommendation systems, autonomous driving, content generation, medical diagnostics, and scientific discovery. However, their widespread deployment also exposes them to significant safety risks, raising concerns about robustness, reliability, and ethical implications. This survey provides a systematic review of current safety research on large models, covering Vision Foundation Models (VFMs), Large Language Models (LLMs), Vision-Language Pre-training (VLP) models, Vision-Language Models (VLMs), Diffusion Models (DMs), and large-model-powered Agents. Our contributions are summarized as follows: (1) We present a comprehensive taxonomy of safety threats to these models, including adversarial attacks, data poisoning, backdoor attacks, jailbreak and prompt injection attacks, energy-latency attacks, data and model extraction attacks, and emerging agent-specific threats. (2) We review defense strategies proposed for each type of attacks if available and summarize the commonly used datasets and benchmarks for safety research. (3) Building on this, we identify and discuss the open challenges in large model safety, emphasizing the need for comprehensive safety evaluations, scalable and effective defense mechanisms, and sustainable data practices. More importantly, we highlight the necessity of collective efforts from the research community and international collaboration. Our work can serve as a useful reference for researchers and practitioners, fostering the ongoing development of comprehensive defense systems and platforms to safeguard AI models.

翻译：随着大规模预训练使模型在学习与泛化能力上取得突破性进展，人工智能（AI）格局已发生根本性重塑。这些模型已成为对话系统、推荐系统、自动驾驶、内容生成、医疗诊断及科学发现等广泛应用的基础支撑。然而，其广泛部署也暴露了显著的安全风险，引发对鲁棒性、可靠性及伦理影响的关注。本综述系统梳理了当前面向大规模模型的安全性研究进展，涵盖视觉基础模型（VFMs）、大语言模型（LLMs）、视觉语言预训练（VLP）模型、视觉语言模型（VLMs）、扩散模型（DMs）以及基于大模型的智能体。我们的贡献可概括为：（1）构建了面向这些模型的综合性安全威胁分类体系，包括对抗攻击、数据投毒、后门攻击、越狱与提示注入攻击、能量延迟攻击、数据与模型窃取攻击，以及新兴的智能体特有威胁；（2）针对各类攻击（若存在相应方案）系统评述了防御策略，并总结了安全研究中广泛使用的数据集与基准；（3）在此基础上，识别并探讨了大规模模型安全领域的开放性挑战，强调亟需综合安全评估、可扩展的有效防御机制及可持续数据实践。更为重要的是，我们呼吁研究共同体与国际社会开展协同合作。本工作可为研究人员与从业者提供重要参考，推动构建全面防御体系与平台以保障AI模型安全。