White-box monitors are a popular technique for detecting potentially harmful behaviours in language models. While they perform well in general, their effectiveness in detecting text-ambiguous behaviour is disputed. In this work, we find evidence that removing textual evidence of a behaviour significantly decreases probe performance. The AUROC reduction ranges from $10$- to $30$-point depending on the setting. We evaluate probe monitors across three setups (Sandbagging, Sycophancy, and Bias), finding that when probes rely on textual evidence of the target behaviour (such as system prompts or CoT reasoning), performance degrades once these tokens are filtered. This filtering procedure is standard practice for output monitor evaluation. As further evidence of this phenomenon, we train Model Organisms which produce outputs without any behaviour verbalisations. We validate that probe performance on Model Organisms is substantially lower than unfiltered evaluations: $0.57$ vs $0.74$ AUROC for Bias, and $0.57$ vs $0.94$ AUROC for Sandbagging. Our findings suggest that linear probes may be brittle in scenarios where they must detect non-surface-level patterns.
翻译:白盒监控器是检测语言模型中潜在有害行为的流行技术。虽然其总体表现良好,但其在检测文本模糊行为方面的有效性存在争议。本研究发现,移除行为的文本证据会显著降低探针性能。AUROC的下降幅度在$10$至$30$个百分点之间,具体取决于实验设置。我们在三种场景(Sandbagging、Sycophancy和Bias)中评估了探针监控器,发现当探针依赖目标行为的文本证据(如系统提示或思维链推理)时,一旦这些标记被过滤,性能就会下降。这种过滤程序是输出监控器评估的标准实践。作为此现象的进一步证据,我们训练了不产生任何行为言语化输出的模型有机体。我们验证了在模型有机体上的探针性能显著低于未过滤的评估:Bias任务的AUROC为$0.57$对比$0.74$,Sandbagging任务的AUROC为$0.57$对比$0.94$。我们的研究结果表明,在线性探针必须检测非表层模式的情况下,其表现可能较为脆弱。