As generative large model capabilities advance, safety concerns become more pronounced in their outputs. To ensure the sustainable growth of the AI ecosystem, it's imperative to undertake a holistic evaluation and refinement of associated safety risks. This survey presents a framework for safety research pertaining to large models, delineating the landscape of safety risks as well as safety evaluation and improvement methods. We begin by introducing safety issues of wide concern, then delve into safety evaluation methods for large models, encompassing preference-based testing, adversarial attack approaches, issues detection, and other advanced evaluation methods. Additionally, we explore the strategies for enhancing large model safety from training to deployment, highlighting cutting-edge safety approaches for each stage in building large models. Finally, we discuss the core challenges in advancing towards more responsible AI, including the interpretability of safety mechanisms, ongoing safety issues, and robustness against malicious attacks. Through this survey, we aim to provide clear technical guidance for safety researchers and encourage further study on the safety of large models.
翻译:随着生成式大模型能力的进步,其输出中的安全问题日益凸显。为确保人工智能生态系统的可持续发展,必须对相关安全风险进行全面评估与改进。本综述提出了一个面向大模型安全研究的框架,描绘了安全风险全景以及安全评估与改进方法。我们首先介绍广受关注的安全问题,然后深入探讨大模型的安全评估方法,包括基于偏好的测试、对抗攻击方法、问题检测及其他先进评估技术。此外,我们探索了从训练到部署阶段提升大模型安全的策略,重点介绍了大模型构建各阶段的前沿安全方法。最后,我们讨论了迈向更负责任人工智能过程中面临的核心挑战,包括安全机制的可解释性、持续存在的安全问题以及抵御恶意攻击的鲁棒性。通过本综述,我们旨在为安全研究人员提供清晰的技术指导,并鼓励对大模型安全进行进一步研究。