Prompt injection (both direct and indirect) and jailbreaking are now recognized as significant issues for large language models (LLMs), particularly due to their potential for harm in application-integrated contexts. This extended abstract explores a novel approach to protecting LLMs from such attacks, termed "soft begging." This method involves training soft prompts to counteract the effects of corrupted prompts on the LLM's output. We provide an overview of prompt injections and jailbreaking, introduce the theoretical basis of the "soft begging" technique, and discuss an evaluation of its effectiveness.
翻译:提示注入(包括直接与间接形式)与越狱攻击已被公认为大语言模型(LLMs)面临的重大安全威胁,尤其在模型与应用程序集成的场景中可能造成严重危害。本扩展摘要探讨了一种保护大语言模型免受此类攻击的新方法,称为“软性乞求”。该方法通过训练软提示来抵消被篡改提示对模型输出的影响。本文概述了提示注入与越狱攻击的原理,阐述了“软性乞求”技术的理论基础,并对其防护效能进行了评估讨论。