It has recently been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements. In this work, we argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking. We provide a broad overview of possible attack surfaces and attack goals. Based on a series of concrete examples, we discuss, categorize and systematize attacks that coerce varied unintended behaviors, such as misdirection, model control, denial-of-service, or data extraction. We analyze these attacks in controlled experiments, and find that many of them stem from the practice of pre-training LLMs with coding capabilities, as well as the continued existence of strange "glitch" tokens in common LLM vocabularies that should be removed for security reasons.
翻译:最近的研究表明,针对大语言模型的对抗性攻击可以使其“越狱”并生成有害言论。本文认为,大语言模型面临的对抗性攻击范围远不止于越狱。我们系统性地概述了可能的攻击面与攻击目标,通过一系列具体案例讨论、分类并梳理了迫使模型产生各种非预期行为的攻击方式,包括诱导误判、模型控制、拒绝服务攻击以及数据窃取等。我们通过受控实验分析发现,许多攻击源于大语言模型在预训练阶段引入编码能力的技术实践,以及常见大语言模型词表中仍存在的异常“故障词元”——出于安全考虑应予以移除。