Current open-source prompt-injection detectors converge on two architectural choices: regular-expression pattern matching and fine-tuned transformer classifiers. Both share failure modes that recent work has made concrete. Regular expressions miss paraphrased attacks. Fine-tuned classifiers are vulnerable to adaptive adversaries: a 2025 NAACL Findings study reported that eight published indirect-injection defenses were bypassed with greater than fifty percent attack success rates under adaptive attacks. This work proposes seven detection techniques that each port a specific mechanism from a discipline outside large-language-model security: forensic linguistics, materials-science fatigue analysis, deception technology from network security, local-sequence alignment from bioinformatics, mechanism design from economics, spectral signal analysis from epidemiology, and taint tracking from compiler theory. Three of the seven techniques are implemented in the prompt-shield v0.4.1 release (Apache 2.0) and evaluated in a four-configuration ablation across six datasets including deepset/prompt-injections, NotInject, LLMail-Inject, AgentHarm, and AgentDojo. The local-alignment detector lifts F1 on deepset from 0.033 to 0.378 with zero additional false positives. The stylometric detector adds 11.1 percentage points of F1 on an indirect-injection benchmark. The fatigue tracker is validated via a probing-campaign integration test. All code, data, and reproduction scripts are released under Apache 2.0.
翻译:当前开源的提示注入检测器趋同于两种架构选择:正则表达式模式匹配与微调Transformer分类器。两者均存在近期研究已具体化的失效模式:正则表达式无法应对改写型攻击,而微调分类器对自适应攻击者脆弱——2025年NAACL Findings研究显示,在自适应攻击下,八项已发表的间接注入防御方案被突破的攻击成功率超过50%。本文提出七种检测技术,每种技术均移植自大语言模型安全领域之外的特定机制:法医语言学、材料科学疲劳分析、网络安全欺骗技术、生物信息学局部序列比对、经济学机制设计、流行病学频谱信号分析、以及编译器理论污点追踪。其中三项技术已在prompt-shield v0.4.1版本(Apache 2.0许可)中实现,并在包含deepset/prompt-injections、NotInject、LLMail-Inject、AgentHarm及AgentDojo的六个数据集上进行了四配置消融实验评估。局部比对检测器在deepset数据集上将F1值从0.033提升至0.378,且未新增误报。笔迹风格检测器在间接注入基准测试中使F1值增加11.1个百分点。疲劳追踪器通过探测式对抗测试验证。所有代码、数据及复现脚本均以Apache 2.0许可开源发布。