Large language models can be fine-tuned to encode prompt-borne secrets into fluent, seemingly benign outputs. This creates a steganographic exfiltration risk that is difficult to detect with output-level steganalysis. Recent work proposes mechanistic detection using linear probes that recover the secret from internal activations. We show that this defense can be systematically evaded, but that detectability can be recovered through a targeted data-level intervention. First, we extend the detection setup to include a non-linear MLP probe. We then adversarially fine-tune steganographic trojans across five base models: Qwen3-8B, Llama-3.1-8B, Ministral-8B, Qwen3-14B, and Phi-4-14B. The resulting models retain $58$--$79\%$ exact-match secret recovery while evading both ridge and held-out MLP probes, with $1$--$8\%$ average capability degradation across six benchmarks. We then give an information-theoretic characterization of this evasion. Successful evasion preserves recoverability while reducing low-order extractability of the secret from the content-aligned representation, forcing the payload into synergistic interaction with residual degrees of freedom. This motivates a recontextualization dataset that restricts these residual degrees of freedom. On this distribution, both ridge and MLP detectability are restored across all five evasive trojans. Overall, our findings show that activation-based steganography detection is vulnerable to adaptive evasion, but also that theory-guided evaluation distributions can expose otherwise hidden payloads.
翻译:大语言模型可通过微调将蕴含在提示中的秘密编码为流畅且看似无害的输出。这种隐写式信息窃取风险难以通过输出级隐写分析检测。近期研究提出使用线性探针从内部激活中恢复秘密的机制检测方法。我们证明该防御可被系统性规避,但通过定向数据级干预可恢复可检测性。首先,我们将检测框架扩展至包含非线性MLP探针。随后,我们在五个基础模型(Qwen3-8B、Llama-3.1-8B、Ministral-8B、Qwen3-14B 和 Phi-4-14B)上对隐写木马进行对抗性微调。所得模型在保持$58$–$79\%$精确匹配秘密恢复率的同时,成功规避了岭回归探针与保留MLP探针,且六项基准测试的平均能力退化仅为$1$–$8\%$。我们进一步从信息论角度刻画此规避行为:成功规避在保持可恢复性的同时,降低了从内容对齐表征中提取秘密的低阶可提取性,迫使载荷与残余自由度产生协同交互。据此我们提出一种限制残余自由度的重新语境化数据集。在该数据分布下,岭回归探针与MLP探针对全部五个隐写木马的可检测性均得以恢复。总体而言,我们的研究表明基于激活的隐写检测易受自适应规避攻击,但理论指导的评估分布可暴露原本隐藏的载荷。