Activation steering methods enable inference-time control of large language model (LLM) behavior without retraining, but current approaches face a fundamental trade-off: sample-efficient methods suboptimally capture steering signals from labeled examples, while methods that better extract these signals require hundreds to thousands of examples. We introduce COLD-Steer, a training-free framework that steers LLM activations by approximating the representational changes that would result from gradient descent on in-context examples. Our key insight is that the effect of fine-tuning on a small set of examples can be efficiently approximated at inference time without actual parameter updates. We formalize this through two complementary approaches: (i) a unit kernel approximation method that updates the activations directly using gradients with respect to them, normalized across examples, and (ii) a finite-difference approximation requiring only two forward passes regardless of example count. Experiments across a variety of steering tasks and benchmarks demonstrate that COLD-Steer achieves upto 95% steering effectiveness while using 50 times fewer samples compared to the best baseline. COLD-Steer facilitates accommodating diverse perspectives without extensive demonstration data, which we validate through our experiments on pluralistic alignment tasks. Our framework opens new possibilities for adaptive, context-aware model control that can flexibly address varying loss-driven human preferences through principled approximation of learning dynamics rather than specialized training procedures.
翻译:激活引导方法能够在无需重新训练的情况下,在推理阶段控制大语言模型(LLM)的行为,但现有方法面临一个根本性的权衡:样本高效的方法只能次优地从标注示例中捕捉引导信号,而能更好提取这些信号的方法则需要数百至数千个示例。我们提出了COLD-Steer,一种无需训练的框架,它通过近似梯度下降在上下文示例上会产生的表征变化来引导LLM的激活。我们的核心见解是,在小规模示例集上进行微调的效果可以在推理时高效地近似,而无需实际更新参数。我们通过两种互补的方法对此进行形式化:(i)一种单元核近似方法,直接使用相对于激活的梯度来更新激活,并在示例间进行归一化;(ii)一种有限差分近似方法,无论示例数量多少,仅需两次前向传播。在多种引导任务和基准测试上的实验表明,与最佳基线方法相比,COLD-Steer在仅使用1/50样本量的情况下,能达到高达95%的引导效果。COLD-Steer有助于在不依赖大量演示数据的情况下容纳多元观点,我们在多元化对齐任务上的实验验证了这一点。我们的框架为自适应、上下文感知的模型控制开辟了新可能性,它能够通过对学习动态的原则性近似(而非专门的训练过程)来灵活应对不同损失驱动的人类偏好。