Multi-turn prompt injection attacks distribute malicious intent across multiple conversation turns, exploiting the assumption that each turn is evaluated independently. While single-turn detection has been extensively studied, no published formula exists for aggregating per-turn pattern scores into a conversation-level risk score at the proxy layer -- without invoking an LLM. We identify a fundamental flaw in the intuitive weighted-average approach: it converges to the per-turn score regardless of turn count, meaning a 20-turn persistent attack scores identically to a single suspicious turn. Drawing on analogies from change-point detection (CUSUM), Bayesian belief updating, and security risk-based alerting, we propose peak + accumulation scoring -- a formula combining peak single-turn risk, persistence ratio, and category diversity. Evaluated on 10,654 multi-turn conversations -- 588 attacks sourced from WildJailbreak adversarial prompts and 10,066 benign conversations from WildChat -- the formula achieves 90.8% recall at 1.20% false positive rate with an F1 of 85.9%. A sensitivity analysis over the persistence parameter reveals a phase transition at rho ~ 0.4, where recall jumps 12 percentage points with negligible FPR increase. We release the scoring algorithm, pattern library, and evaluation harness as open source.
翻译:多轮提示注入攻击将恶意意图分散在多个对话轮次中,利用了每轮对话被独立评估的假设。尽管单轮检测已得到广泛研究,但目前尚无公开的公式能在代理层将每轮的模式评分聚合为对话级别的风险评分——且无需调用大语言模型。我们指出了直观的加权平均方法存在一个根本缺陷:无论对话轮次多少,其评分都会收敛于单轮评分,这意味着一个持续20轮的攻击与单个可疑轮次的评分相同。借鉴变点检测(CUSUM)、贝叶斯信念更新以及基于安全风险的告警机制,我们提出了峰值 + 累积评分法——一种结合了峰值单轮风险、持续比率和类别多样性的公式。在10,654个多轮对话(其中588个攻击源自WildJailbreak对抗性提示,10,066个良性对话源自WildChat)上的评估显示,该公式在1.20%的误报率下实现了90.8%的召回率,F1分数为85.9%。对持续参数进行的敏感性分析揭示了在rho ~ 0.4处存在一个相变点,此时召回率跃升12个百分点,而误报率增加可忽略不计。我们将评分算法、模式库和评估工具作为开源项目发布。