Large language models can be steered at inference time through prompting or activation interventions, but activation steering methods often underperform compared to prompt-based approaches. We propose a framework that formulates prompt steering as a form of activation steering and investigates whether distilling successful prompt steering behavior into simpler, interpretable models can close this gap. Our analysis reveals that popular activation steering methods are not faithful to the mechanics of prompt steering, which applies strong interventions on some tokens while barely affecting others. Based on these insights, we introduce Prompt Steering Replacement (PSR) models that estimate token-specific steering coefficients from the activations themselves and are trained to imitate prompt-based interventions. Experiments on three steering benchmarks across multiple language models show that PSR models outperform existing activation steering methods, especially when controlling for high-coherence completions, and also compare favorably to prompting on AxBench and persona steering.
翻译:大型语言模型可以在推理时通过提示或激活干预进行引导,但激活引导方法往往表现不如基于提示的方法。我们提出一个框架,将提示引导形式化为激活引导的一种形式,并研究将成功的提示引导行为蒸馏到更简单、可解释的模型中是否能缩小这一差距。我们的分析揭示,流行的激活引导方法并不忠实于提示引导的机制——后者对某些令牌施加强干预,而对其他令牌几乎无影响。基于这些见解,我们引入了提示引导替换(PSR)模型,该模型从激活本身估计令牌特定的引导系数,并训练以模仿基于提示的干预。在多个语言模型的三个引导基准上的实验表明,PSR模型优于现有激活引导方法,尤其是在控制高连贯性完成时,且在AxBench和人格引导上与提示方法相比也更具优势。