Linear activation steering is a powerful approach for eliciting the capabilities of large language models and specializing their behavior using limited labeled data. While effective, existing methods often apply a fixed steering strength to all tokens, resulting in inconsistent steering quality across diverse input prompts. In this work, we introduce Contextual Linear Activation Steering (CLAS), a method that dynamically adapts linear activation steering to context-dependent steering strengths. Across eleven steering benchmarks and four model families, it consistently outperforms standard linear activation steering and matches or exceeds the performance of ReFT and LoRA in settings with limited labeled data. We therefore propose CLAS as a scalable, interpretable, and accurate method for specializing and steering large language models.
翻译:线性激活引导是一种强大的方法,用于激发大型语言模型的能力并使用有限的标注数据专门化其行为。尽管有效,现有方法通常对所有token应用固定的引导强度,导致在不同输入提示下引导质量不一致。在本工作中,我们引入了上下文线性激活引导(CLAS),一种将线性激活引导动态适应于上下文相关引导强度的方法。在十一个引导基准测试和四个模型家族上,它始终优于标准线性激活引导,并在有限标注数据设置中达到或超越ReFT和LoRA的性能。因此,我们提出CLAS作为一种可扩展、可解释且精确的方法,用于专门化和引导大型语言模型。