Activation steering methods control large language model (LLM) behavior by modifying internal activations at inference time. However, most existing activation steering methods rely on a fixed steering strength, leading to either insufficient control or unadapted intervention that degrades text plausibility and coherence. We introduce In-Distribution Steering (IDS), a novel method that adapts steering strength based on the input data distribution in representation space. IDS dynamically adjusts interventions according to how far a given input lies within the distribution, enabling adaptive intervention and generation stability during text generation. Experiments demonstrate that IDS achieves strong accuracy on classification tasks while producing coherent text without collapse, making IDS particularly well suited for real-world applications.
翻译:激活引导方法通过在推理时修改大型语言模型(LLM)的内部激活来控制其行为。然而,现有的大多数激活引导方法依赖于固定的引导强度,这会导致控制不足或不恰当的干预,从而降低文本的合理性与连贯性。我们提出了分布内引导(IDS),这是一种新颖的方法,它根据输入数据在表示空间中的分布情况自适应地调整引导强度。IDS 能够根据给定输入在分布中的偏离程度动态调整干预,从而在文本生成过程中实现自适应干预并保持生成稳定性。实验表明,IDS 在分类任务上实现了较高的准确率,同时能生成连贯且无崩溃的文本,这使得 IDS 特别适用于现实世界的应用场景。