Recent LLMs increasingly integrate reasoning mechanisms like Chain-of-Thought (CoT). However, this explicit reasoning exposes a new attack surface for inference-time backdoors, which inject malicious reasoning paths without altering model parameters. Because these attacks generate linguistically coherent paths, they effectively evade conventional detection. To address this, we propose STAR (State-Transition Amplification Ratio), a framework that detects backdoors by analyzing output probability shifts. STAR exploits the statistical discrepancy where a malicious input-induced path exhibits high posterior probability despite a low prior probability in the model's general knowledge. We quantify this state-transition amplification and employ the CUSUM algorithm to detect persistent anomalies. Experiments across diverse models (8B-70B) and five benchmark datasets demonstrate that STAR exhibits robust generalization capabilities, consistently achieving near-perfect performance (AUROC $\approx$ 1.0) with approximately $42\times$ greater efficiency than existing baselines. Furthermore, the framework proves robust against adaptive attacks attempting to bypass detection.
翻译:近年来,大型语言模型(LLM)越来越多地集成诸如思维链(Chain-of-Thought, CoT)等推理机制。然而,这种显式推理为推断时后门攻击暴露了新的攻击面,此类攻击可在不改变模型参数的情况下注入恶意推理路径。由于这些攻击生成的路径在语言上具有连贯性,它们能有效规避传统检测方法。为解决此问题,我们提出了STAR(状态转移放大比)框架,该框架通过分析输出概率的偏移来检测后门。STAR利用了这样一种统计差异:在模型的一般知识中先验概率较低的恶意输入诱导路径,却表现出较高的后验概率。我们量化了这种状态转移放大效应,并采用CUSUM算法来检测持续的异常。在多种模型(8B-70B)和五个基准数据集上的实验表明,STAR展现出强大的泛化能力,始终实现近乎完美的性能(AUROC $\approx$ 1.0),且效率比现有基线方法高出约$42\times$。此外,该框架被证明对于试图规避检测的自适应攻击具有鲁棒性。