DualSentinel: A Lightweight Framework for Detecting Targeted Attacks in Black-box LLM via Dual Entropy Lull Pattern

Recent intelligent systems integrate powerful Large Language Models (LLMs) through APIs, but their trustworthiness may be critically undermined by targeted attacks like backdoor and prompt injection attacks, which secretly force LLMs to generate specific malicious sequences. Existing defensive approaches for such threats typically rely on high access rights, impose prohibitive costs, and hinder normal inference, rendering them impractical for real-world scenarios. To solve these limitations, we introduce DualSentinel, a lightweight and unified defense framework that can accurately and promptly detect the activation of targeted attacks alongside the LLM generation process. We first identify a characteristic of compromised LLMs, termed Entropy Lull: when a targeted attack successfully hijacks the generation process, the LLM exhibits a distinct period of abnormally low and stable token probability entropy, indicating it is following a fixed path rather than making creative choices. DualSentinel leverages this pattern by developing an innovative dual-check approach. It first employs a magnitude and trend-aware monitoring method to proactively and sensitively flag an entropy lull pattern at runtime. Upon such flagging, it triggers a lightweight yet powerful secondary verification based on task-flipping. An attack is confirmed only if the entropy lull pattern persists across both the original and the flipped task, proving that the LLM's output is coercively controlled. Extensive evaluations show that DualSentinel is both highly effective (superior detection accuracy with near-zero false positives) and remarkably efficient (negligible additional cost), offering a truly practical path toward securing deployed LLMs. The source code can be accessed at https://doi.org/10.5281/zenodo.18479273.

翻译：当前智能系统普遍通过API集成功能强大的大语言模型（LLM），但其可信度可能因后门攻击和提示注入攻击等定向攻击而受到严重威胁，这些攻击会暗中迫使LLM生成特定的恶意序列。针对此类威胁的现有防御方法通常需要高访问权限、成本高昂且会妨碍正常推理，在实际场景中难以应用。为解决这些局限性，我们提出了DualSentinel——一种轻量级统一防御框架，能够在大语言模型生成过程中准确、及时地检测定向攻击的激活。我们首先识别出受损LLM的一个特征，称为“熵谷现象”：当定向攻击成功劫持生成过程时，LLM会呈现一段明显异常偏低且稳定的词元概率熵，表明其正遵循固定路径而非进行创造性选择。DualSentinel通过创新的双重检验机制利用这一模式：首先采用基于幅度与趋势感知的监控方法，在运行时主动且灵敏地标记熵谷模式；当标记触发后，启动基于任务翻转的轻量级强效二次验证。仅当熵谷模式在原始任务和翻转任务中持续存在时，才确认攻击发生，从而证明LLM输出受到强制控制。大量实验表明，DualSentinel同时具备高效性（检测准确率优于现有方法且误报率接近零）与卓越的运行效率（附加成本可忽略不计），为保护已部署LLM提供了真正实用的解决方案。源代码可通过https://doi.org/10.5281/zenodo.18479273获取。