Prompt-injection detectors are heterogeneous: each is strong on a different slice of attacks, and none is always reliable. Yet existing systems still treat detection as a fixed single-detector pipeline, committing every request to one detector's blind spots. We reframe defense as detector allocation: given a heterogeneous pool, decide per request which detectors to run and whether to escalate to an LLM judge. Our framework SCOUT (Scalable and Controllable Outcome-prediction for Uncertainty-aware Triage) makes this decision dynamic by predicting each detector's per-sample reliability and latency from how it behaved on similar past inputs, and exposes a single safety-utility threshold to the operator (where utility bundles benign-pass rate and wall-clock). To evaluate this setting, we build SCOUT-450, a benchmark that captures the structurally complex, agent-facing injections that older prompt-injection sets under-represent. On SCOUT-450, a safety-oriented operating point reduces attack-success rate by 46% and total wall-clock by 40% relative to an always-on GPT-4o judge, at a 5.1-point benign-utility drop. SCOUT also transfers to three external benchmarks (BIPIA, IPI, and IHEval), improving the safety-utility frontier.
翻译:摘要:提示注入检测器具有异质性:每种检测器在不同攻击切片上表现强劲,但无一始终可靠。然而,现有系统仍将检测视为固定单检测器流水线,使每个请求暴露于单一检测器的盲区。我们重新将防御框架定义为检测器分配:给定一个异质检测器池,针对每个请求决定运行哪些检测器、以及是否升级至大语言模型裁判。我们的框架SCOUT(可扩展且可控的结果预测不确定性分诊)通过预测每个检测器在类似历史输入中的样本级可靠性与延迟,将此决策动态化,并向操作员暴露单一安全-效用阈值(其中效用包含良性通过率与挂钟时间)。为评估该场景,我们构建了SCOUT-450基准测试,该基准涵盖了过时提示注入测试集未能充分表征的结构复杂、面向智能体的注入方式。在SCOUT-450上,相较于始终启用GPT-4o裁判的方案,安全导向操作点将攻击成功率降低46%、总挂钟时间降低40%,同时良性效用仅下降5.1个百分点。SCOUT还可迁移至三个外部基准测试(BIPIA、IPI、IHEval),提升了安全-效用前沿。