Contrastive steering has been shown as a simple and effective method to adjust the generative behavior of LLMs at inference time. It uses examples of prompt responses with and without a trait to identify a direction in an intermediate activation layer, and then shifts activations in this 1-dimensional subspace. However, despite its growing use in AI safety applications, the robustness of contrastive steering to noisy or adversarial data corruption is poorly understood. We initiate a study of the robustness of this process with respect to corruption of the dataset of examples used to train the steering direction. Our first observation is that contrastive steering is quite robust to a moderate amount of corruption, but unwanted side effects can be clearly and maliciously manifested when a non-trivial fraction of the training data is altered. Second, we analyze the geometry of various types of corruption, and identify some safeguards. Notably, a key step in learning the steering direction involves high-dimensional mean computation, and we show that replacing this step with a recently developed robust mean estimator often mitigates most of the unwanted effects of malicious corruption.
翻译:对比引导已被证明是一种在推理时调整大型语言模型生成行为的简单有效方法。该方法利用具有或不具有特定属性的提示-响应对示例,在中间激活层中识别一个方向,然后沿这一维子空间移动激活向量。然而,尽管该方法在人工智能安全领域的应用日益广泛,但其对噪声或对抗性数据污染的鲁棒性尚未得到充分理解。我们首次针对用于训练引导方向的示例数据集遭受污染时该过程的鲁棒性展开研究。我们的首要观察是:对比引导对中等程度的数据污染表现出较强的鲁棒性,但当训练数据中有相当比例被篡改时,可能引发明显且恶意的副作用。其次,我们分析了多种污染类型的几何特性,并提出了相应的防护机制。值得注意的是,学习引导方向的关键步骤涉及高维均值计算;我们证明,若将该步骤替换为近期开发的鲁棒均值估计器,通常能够有效缓解恶意污染导致的大部分不良影响。