Large Language Models (LLMs) are swiftly advancing in architecture and capability, and as they integrate more deeply into complex systems, the urgency to scrutinize their security properties grows. This paper surveys research in the emerging interdisciplinary field of adversarial attacks on LLMs, a subfield of trustworthy ML, combining the perspectives of Natural Language Processing and Security. Prior work has shown that even safety-aligned LLMs (via instruction tuning and reinforcement learning through human feedback) can be susceptible to adversarial attacks, which exploit weaknesses and mislead AI systems, as evidenced by the prevalence of `jailbreak' attacks on models like ChatGPT and Bard. In this survey, we first provide an overview of large language models, describe their safety alignment, and categorize existing research based on various learning structures: textual-only attacks, multi-modal attacks, and additional attack methods specifically targeting complex systems, such as federated learning or multi-agent systems. We also offer comprehensive remarks on works that focus on the fundamental sources of vulnerabilities and potential defenses. To make this field more accessible to newcomers, we present a systematic review of existing works, a structured typology of adversarial attack concepts, and additional resources, including slides for presentations on related topics at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL'24).
翻译:大型语言模型(LLMs)在架构和能力上迅速演进,随着它们更深入地融入复杂系统,审视其安全属性的紧迫性日益增长。本文综述了LLMs对抗攻击这一新兴跨学科领域的研究——该领域属于可信机器学习的分支,融合了自然语言处理与安全视角。先前研究表明,即使是经过安全对齐的LLMs(通过指令微调和人类反馈强化学习),仍可能受到对抗攻击的威胁。这类攻击利用系统弱点误导人工智能系统,例如对ChatGPT和Bard等模型广泛实施的"越狱"攻击。在本综述中,我们首先概述大型语言模型及其安全对齐机制,并根据不同学习结构对现有研究进行分类:纯文本攻击、多模态攻击,以及针对复杂系统(如联邦学习或多智能体系统)的额外攻击方法。我们还针对聚焦漏洞根本来源及潜在防御的研究进行了全面评述。为便于新研究人员快速入门,我们系统梳理了现有工作,构建了对抗攻击概念的结构化分类体系,并提供了补充资源,包括在第62届计算语言学协会年会(ACL'24)相关主题演示中使用的幻灯片。