从负面示例中学习：为何警告式训练数据反而会教授其警示的内容 (Learning from Negative Examples: Why Warning-Framed Training Data Teaches What It Warns Against)

Warning-framed content in training data (e.g., "DO NOT USE - this code is vulnerable") does not, it turns out, teach language models to avoid the warned-against behavior. In experiments reported here, models exposed to such warnings reproduced the flagged content at rates statistically indistinguishable from models given the content directly (76.7% vs. 83.3%). Why? Sparse autoencoder analysis points to a failure of orthogonalization: "describing X" and "performing X" activate overlapping latent features. Feature #8684, which tracks code execution patterns, fires at comparable magnitude in both warning and exploitation contexts. A related phenomenon, what I call "stealth slip", allows conversational preambles to rotate activations into subspaces that linear probes miss entirely. Prompting and inference-time steering do not fix this; training-time feature ablation does. The upshot is that statistical co-occurrence dominates over pragmatic interpretation in current architectures. Models learn what tends to follow a context, not why it appeared there.

翻译：训练数据中的警告式内容（例如“请勿使用——此代码存在漏洞”）实际上并不能教会语言模型避免所警示的行为。本文报告的实验表明，接触此类警告的模型重现被标记内容的比率，与直接获得该内容的模型在统计上无显著差异（76.7%对比83.3%）。原因何在？稀疏自编码器分析表明，这源于正交化失败：“描述X”与“执行X”会激活重叠的潜在特征。追踪代码执行模式的特征#8684在警告情境和利用情境中均以相近强度激活。一个相关现象——我称之为“隐性偏移”——使得对话式前言能将激活向量旋转至线性探测器完全遗漏的子空间。提示工程和推理时导向无法解决此问题；训练时的特征消融则可实现修正。其核心在于：在当前架构中，统计共现关系主导了语用解释。模型学习的是特定语境后通常出现的内容，而非其出现的原因。