COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics

Activation steering methods enable inference-time control of large language model (LLM) behavior without retraining, but current approaches face a fundamental trade-off: sample-efficient methods suboptimally capture steering signals from labeled examples, while methods that better extract these signals require hundreds to thousands of examples. We introduce COLD-Steer, a training-free framework that steers LLM activations by approximating the representational changes that would result from gradient descent on in-context examples. Our key insight is that the effect of fine-tuning on a small set of examples can be efficiently approximated at inference time without actual parameter updates. We formalize this through two complementary approaches: (i) a unit kernel approximation method that updates the activations directly using gradients with respect to them, normalized across examples, and (ii) a finite-difference approximation requiring only two forward passes regardless of example count. Experiments across a variety of steering tasks and benchmarks demonstrate that COLD-Steer achieves upto 95% steering effectiveness while using 50 times fewer samples compared to the best baseline. COLD-Steer facilitates accommodating diverse perspectives without extensive demonstration data, which we validate through our experiments on pluralistic alignment tasks. Our framework opens new possibilities for adaptive, context-aware model control that can flexibly address varying loss-driven human preferences through principled approximation of learning dynamics rather than specialized training procedures.

翻译：激活引导方法能够在无需重新训练的情况下，在推理阶段控制大语言模型（LLM）的行为，但现有方法面临一个根本性的权衡：样本高效的方法只能次优地从标注示例中捕捉引导信号，而能更好提取这些信号的方法则需要数百至数千个示例。我们提出了COLD-Steer，一种无需训练的框架，它通过近似梯度下降在上下文示例上会产生的表征变化来引导LLM的激活。我们的核心见解是，在小规模示例集上进行微调的效果可以在推理时高效地近似，而无需实际更新参数。我们通过两种互补的方法对此进行形式化：（i）一种单元核近似方法，直接使用相对于激活的梯度来更新激活，并在示例间进行归一化；（ii）一种有限差分近似方法，无论示例数量多少，仅需两次前向传播。在多种引导任务和基准测试上的实验表明，与最佳基线方法相比，COLD-Steer在仅使用1/50样本量的情况下，能达到高达95%的引导效果。COLD-Steer有助于在不依赖大量演示数据的情况下容纳多元观点，我们在多元化对齐任务上的实验验证了这一点。我们的框架为自适应、上下文感知的模型控制开辟了新可能性，它能够通过对学习动态的原则性近似（而非专门的训练过程）来灵活应对不同损失驱动的人类偏好。