Beyond Pattern Matching: Seven Cross-Domain Techniques for Prompt Injection Detection

from arxiv, v3.0 (18 May 2026): Added Sec. 5.6 with independent evaluation on three peer-reviewed benchmarks (Liu, USENIX Sec 2024; Garak, Derczynski 2024; InjecAgent, ACL Findings 2024). 8,276 unseen attacks; cross-benchmark plateau at 35-45% on subtle indirect injection. Abstract, contributions, Sec. 6, and 6 refs updated

Current open-source prompt-injection detectors converge on two architectural choices: regular-expression pattern matching and fine-tuned transformer classifiers. Both share failure modes that recent work has made concrete. Regular expressions miss paraphrased attacks. Fine-tuned classifiers are vulnerable to adaptive adversaries: a 2025 NAACL Findings study reported that eight published indirect-injection defenses were bypassed with greater than fifty percent attack success rates under adaptive attacks. This work proposes seven detection techniques that each port a specific mechanism from a discipline outside large-language-model security: forensic linguistics, materials-science fatigue analysis, deception technology from network security, local-sequence alignment from bioinformatics, mechanism design from economics, spectral signal analysis from epidemiology, and taint tracking from compiler theory. Three of the seven techniques are implemented in the prompt-shield v0.4.1 release (Apache 2.0) and evaluated in a four-configuration ablation across six datasets including deepset/prompt-injections, NotInject, LLMail-Inject, AgentHarm, and AgentDojo. The local-alignment detector lifts F1 on deepset from 0.033 to 0.378 with zero additional false positives. The stylometric detector adds 11.1 percentage points of F1 on an indirect-injection benchmark. The fatigue tracker is validated via a probing-campaign integration test. All code, data, and reproduction scripts are released under Apache 2.0.

翻译：当前开源提示注入检测器集中在两种架构选择上：正则表达式模式匹配与微调后的Transformer分类器。二者共享的失效模式已被近期研究具体揭示：正则表达式无法识别改写型攻击；微调分类器易遭受自适应对手攻击——2025年NAACL Findings研究指出，在自适应攻击下，八种已发表的间接注入防御均被以超过50%的攻击成功率绕过。本文提出七项检测技术，每项技术分别移植自大语言模型安全领域之外的特定机制：司法语言学、材料科学中的疲劳分析、网络安全中的欺骗技术、生物信息学中的局部序列比对、经济学中的机制设计、流行病学中的频谱信号分析，以及编译器理论中的污点追踪。其中三项技术已在prompt-shield v0.4.1版本（Apache 2.0许可）中实现，并在包括deepset/prompt-injections、NotInject、LLMail-Inject、AgentHarm及AgentDojo在内的六个数据集上进行了四种配置的消融评估。局部比对检测器在零额外误报条件下，将deepset数据集上的F1值从0.033提升至0.378；笔迹风格检测器在间接注入基准测试上使F1值增加11.1个百分点；疲劳追踪器通过探测式活动集成测试完成验证。所有代码、数据及复现脚本均以Apache 2.0许可公开发布。