We introduce a new family of prompt injection attacks, termed Neural Exec. Unlike known attacks that rely on handcrafted strings (e.g., "Ignore previous instructions and..."), we show that it is possible to conceptualize the creation of execution triggers as a differentiable search problem and use learning-based methods to autonomously generate them. Our results demonstrate that a motivated adversary can forge triggers that are not only drastically more effective than current handcrafted ones but also exhibit inherent flexibility in shape, properties, and functionality. In this direction, we show that an attacker can design and generate Neural Execs capable of persisting through multi-stage preprocessing pipelines, such as in the case of Retrieval-Augmented Generation (RAG)-based applications. More critically, our findings show that attackers can produce triggers that deviate markedly in form and shape from any known attack, sidestepping existing blacklist-based detection and sanitation approaches.
翻译:我们提出一类新型提示注入攻击,称为神经执行(Neural Exec)。与依赖于手工构造字符串(例如“忽略先前指令并...”)的已知攻击不同,我们证明可将执行触发器的生成概念化为可微分搜索问题,并利用基于学习的方法自主生成它们。结果表明,动机明确的攻击者能够锻造出不仅比当前手工触发器显著更有效,且在形态、属性及功能上具有内在灵活性的触发器。据此,我们展示攻击者可以设计并生成能够穿透多阶段预处理流水线(如基于检索增强生成(RAG)的应用场景)的神经执行触发器。更为关键的是,我们的发现表明攻击者可以生成在形式和形态上与任何已知攻击显著偏离的触发器,从而规避现有基于黑名单的检测与净化方法。