We introduce a new family of prompt injection attacks, termed Neural Exec. Unlike known attacks that rely on handcrafted strings (e.g., "Ignore previous instructions and..."), we show that it is possible to conceptualize the creation of execution triggers as a differentiable search problem and use learning-based methods to autonomously generate them. Our results demonstrate that a motivated adversary can forge triggers that are not only drastically more effective than current handcrafted ones but also exhibit inherent flexibility in shape, properties, and functionality. In this direction, we show that an attacker can design and generate Neural Execs capable of persisting through multi-stage preprocessing pipelines, such as in the case of Retrieval-Augmented Generation (RAG)-based applications. More critically, our findings show that attackers can produce triggers that deviate markedly in form and shape from any known attack, sidestepping existing blacklist-based detection and sanitation approaches.
翻译:我们引入了一种新的提示注入攻击家族,称为神经执行(Neural Exec)。与依赖手工构造字符串(例如“忽略之前的指令并...”)的已知攻击不同,我们展示了可以将执行触发器的创建概念化为一个可微分搜索问题,并利用基于学习的方法自主生成这些触发器。我们的结果表明,有动机的 adversary 可以构建出不仅比当前手工触发器有效得多,而且在形状、属性和功能上具有内在灵活性的触发器。在此方向上,我们展示了攻击者可以设计并生成能够在多阶段预处理流程中持续存在的神经执行,例如在基于检索增强生成(RAG)的应用中。更重要的是,我们的发现表明,攻击者可以产生在形式和形状上与任何已知攻击显著不同的触发器,从而规避现有的基于黑名单的检测与净化方法。