Faithful Bi-Directional Model Steering via Distribution Matching and Distributed Interchange Interventions

Intervention-based model steering offers a lightweight and interpretable alternative to prompting and fine-tuning. However, by adapting strong optimization objectives from fine-tuning, current methods are susceptible to overfitting and often underperform, sometimes generating unnatural outputs. We hypothesize that this is because effective steering requires the faithful identification of internal model mechanisms, not the enforcement of external preferences. To this end, we build on the principles of distributed alignment search (DAS), the standard for causal variable localization, to propose a new steering method: Concept DAS (CDAS). While we adopt the core mechanism of DAS, distributed interchange intervention (DII), we introduce a novel distribution matching objective tailored for the steering task by aligning intervened output distributions with counterfactual distributions. CDAS differs from prior work in two main ways: first, it learns interventions via weak-supervised distribution matching rather than probability maximization; second, it uses DIIs that naturally enable bi-directional steering and allow steering factors to be derived from data, reducing the effort required for hyperparameter tuning and resulting in more faithful and stable control. On AxBench, a large-scale model steering benchmark, we show that CDAS does not always outperform preference-optimization methods but may benefit more from increased model scale. In two safety-related case studies, overriding refusal behaviors of safety-aligned models and neutralizing a chain-of-thought backdoor, CDAS achieves systematic steering while maintaining general model utility. These results indicate that CDAS is complementary to preference-optimization approaches and conditionally constitutes a robust approach to intervention-based model steering. Our code is available at https://github.com/colored-dye/concept_das.

翻译：基于干预的模型调控为提示与微调提供了一种轻量级且可解释的替代方案。然而，由于沿用了微调中强烈的优化目标，现有方法容易过拟合且往往表现不佳，有时会产生不自然的输出。我们假设，这是因为有效的调控需要忠实识别模型的内部机制，而非强制施加外部偏好。为此，我们基于分布式对齐搜索（DAS）——因果变量定位的标准方法——的原则，提出了一种新的调控方法：概念DAS（CDAS）。尽管我们采用了DAS的核心机制——分布式互换干预（DII），但我们引入了一种专为调控任务设计的新型分布匹配目标，通过对齐干预后的输出分布与反事实分布来实现。CDAS与先前工作的主要区别有两点：首先，它通过弱监督的分布匹配而非概率最大化来学习干预；其次，它使用的DII天然支持双向调控，并允许从数据中推导调控因子，从而减少了超参数调优所需的工作量，实现了更忠实和稳定的控制。在AxBench这一大规模模型调控基准测试中，我们表明CDAS并非总是优于偏好优化方法，但可能从模型规模的增大中获益更多。在两个与安全性相关的案例研究中——覆盖安全对齐模型的拒绝行为以及中和思维链后门——CDAS在保持模型通用效用的同时实现了系统性调控。这些结果表明，CDAS是对偏好优化方法的补充，并在特定条件下构成了一种基于干预的模型调控的稳健方法。我们的代码可在 https://github.com/colored-dye/concept_das 获取。