Layerwise Convergence Fingerprints for Runtime Misbehavior Detection in Large Language Models

Large language models deployed at runtime can misbehave in ways that clean-data validation cannot anticipate: training-time backdoors lie dormant until triggered, jailbreaks subvert safety alignment, and prompt injections override the deployer's instructions. Existing runtime defenses address these threats one at a time and often assume a clean reference model, trigger knowledge, or editable weights, assumptions that rarely hold for opaque third-party artifacts. We introduce Layerwise Convergence Fingerprinting (LCF), a tuning-free runtime monitor that treats the inter-layer hidden-state trajectory as a health signal: LCF computes a diagonal Mahalanobis distance on every inter-layer difference, aggregates via Ledoit-Wolf shrinkage, and thresholds via leave-one-out calibration on 200 clean examples, with no reference model, trigger knowledge, or retraining. Evaluated on four architectures (Llama-3-8B, Qwen2.5-7B, Gemma-2-9B, Qwen2.5-14B) across backdoors, jailbreaks, and prompt injection (56 backdoor combinations, 3 jailbreak techniques, and BIPIA email + code-QA), LCF reduces mean backdoor attack success rate (ASR) below 1% on Qwen2.5-7B and Gemma-2 and to 1.3% on Qwen2.5-14B, detects 92-100% of DAN jailbreaks (62-100% for GCG and softer role-play), and flags 100% of text-payload injections across all eight (model, domain) cells, at 12-16% backdoor FPR and <0.1% inference overhead. A single aggregation score covers all three threat families without threat-specific tuning, positioning LCF as a general-purpose runtime safety layer for cloud-served and on-device LLMs.

翻译：在运行时部署的大语言模型可能以清洁数据验证无法预料的方式产生异常行为：训练时的后门在触发前处于休眠状态，越狱攻击破坏安全对齐，提示注入覆盖部署者的指令。现有运行时防御方法逐个应对这些威胁，且通常假设存在清洁参考模型、触发知识或可编辑权重，这些假设对不透明的第三方制品几乎不成立。我们提出分层收敛指纹（LCF），这是一种无需调参的运行时监控方法，将层间隐藏状态轨迹视为健康信号：LCF计算每个层间差异的对角马哈拉诺比斯距离，通过Ledoit-Wolf收缩进行聚合，并在200个清洁样本上使用留一法校准设定阈值，无需参考模型、触发知识或重新训练。在四种架构（Llama-3-8B、Qwen2.5-7B、Gemma-2-9B、Qwen2.5-14B）上针对后门、越狱攻击和提示注入（56种后门组合、3种越狱技术和BIPIA邮件+代码问答）的评估表明：LCF将Qwen2.5-7B和Gemma-2的平均后门攻击成功率（ASR）降至1%以下，Qwen2.5-14B降至1.3%；检测92-100%的DAN越狱（GCG和较软角色扮演为62-100%）；在所有八个（模型、领域）组合中100%标记文本载荷注入，后门假阳性率为12-16%，推理开销低于0.1%。单一聚合分数即可覆盖三类威胁家族而无需针对特定威胁调参，使LCF成为云端和终端大语言模型的通用运行时安全层。