Warning-framed content in training data (e.g., "DO NOT USE - this code is vulnerable") does not, it turns out, teach language models to avoid the warned-against behavior. In experiments reported here, models exposed to such warnings reproduced the flagged content at rates statistically indistinguishable from models given the content directly (76.7% vs. 83.3%). Why? Sparse autoencoder analysis points to a failure of orthogonalization: "describing X" and "performing X" activate overlapping latent features. Feature #8684, which tracks code execution patterns, fires at comparable magnitude in both warning and exploitation contexts. A related phenomenon, what I call "stealth slip", allows conversational preambles to rotate activations into subspaces that linear probes miss entirely. Prompting and inference-time steering do not fix this; training-time feature ablation does. The upshot is that statistical co-occurrence dominates over pragmatic interpretation in current architectures. Models learn what tends to follow a context, not why it appeared there.
翻译:训练数据中的警告式内容(例如“请勿使用——此代码存在漏洞”)实际上并不能教会语言模型避免所警示的行为。本文报告的实验表明,接触此类警告的模型重现被标记内容的比率,与直接获得该内容的模型在统计上无显著差异(76.7%对比83.3%)。原因何在?稀疏自编码器分析表明,这源于正交化失败:“描述X”与“执行X”会激活重叠的潜在特征。追踪代码执行模式的特征#8684在警告情境和利用情境中均以相近强度激活。一个相关现象——我称之为“隐性偏移”——使得对话式前言能将激活向量旋转至线性探测器完全遗漏的子空间。提示工程和推理时导向无法解决此问题;训练时的特征消融则可实现修正。其核心在于:在当前架构中,统计共现关系主导了语用解释。模型学习的是特定语境后通常出现的内容,而非其出现的原因。