Probes trained on model activations can detect undesirable behaviors like deception or biases that are difficult to identify from outputs alone. This makes them useful detectors to identify misbehavior. Furthermore, they are also valuable training signals, since they not only reward outputs, but also good internal processes for arriving at that output. However, training against interpretability tools raises a fundamental concern: when a monitor becomes a training target, it may cease to be reliable (Goodhart's Law). We propose two methods for training against probes based on Supervised Fine-tuning and Direct Preference Optimization. We conduct an initial exploration of these methods in a testbed for reducing toxicity and evaluate the amount by which probe accuracy drops when training against them. To retain the accuracy of probe-detectors after training, we attempt (1) to train against an ensemble of probes, (2) retain held-out probes that aren't used for training, and (3) retrain new probes after training. First, probe-based preference optimization unexpectedly preserves probe detectability better than classifier-based methods, suggesting the preference learning objective incentivizes maintaining rather than obfuscating relevant representations. Second, probe diversity provides minimal practical benefit - simply retraining probes after optimization recovers high detection accuracy. Our findings suggest probe-based training can be viable for certain alignment methods, though probe ensembles are largely unnecessary when retraining is feasible.
翻译:在模型激活上训练的探针能够检测仅从输出难以识别的诸如欺骗或偏见等不良行为。这使其成为识别不当行为的有用检测器。此外,它们也是宝贵的训练信号,因为其不仅奖励输出,也奖励达成该输出的良好内部过程。然而,针对可解释性工具进行训练引发了一个根本性担忧:当监控器成为训练目标时,其可能不再可靠(古德哈特定律)。我们提出了两种基于监督微调和直接偏好优化的针对探针进行训练的方法。我们在一个降低毒性的测试平台上对这些方法进行了初步探索,并评估了针对它们训练时探针准确率的下降程度。为在训练后保持探针检测器的准确性,我们尝试(1)针对探针集成进行训练,(2)保留未用于训练的留出探针,以及(3)在训练后重新训练新探针。首先,基于探针的偏好优化出乎意料地比基于分类器的方法更好地保持了探针可检测性,这表明偏好学习目标激励着维持而非混淆相关表征。其次,探针多样性带来的实际益处有限——在优化后简单地重新训练探针即可恢复高检测准确率。我们的研究结果表明,基于探针的训练对于某些对齐方法可能是可行的,尽管在重新训练可行的情况下,探针集成在很大程度上是不必要的。