Activation steering provides parameter-efficient control over large language models (LLMs) at inference time, but many methods rely on off-distribution supervision and discrete masking, leading to brittle interventions. We propose ROAST (Rollout-based On-distribution Activation Steering Technique), which estimates steering directions from the model's own on-distribution rollouts via ROC and avoids hard sparsification via Continuous Soft Scaling (CSS) and Grouped Mean Normalization. Our empirical analysis reveals that while activation magnitude correlates moderately with directional consistency, the variance in magnitude is significant and often disproportionate to semantic quality. This suggests that high-magnitude activations risk dominating the global steering direction if not properly normalized. To address this, ROAST employs grouped normalization to balance contributions across samples, ensuring a more robust estimation of the consensus steering direction. Across models (0.6B to 32B), ROAST consistently improves performance on diverse tasks (e.g., +9.7% on GSM8K for Qwen3-0.6B and +12.1% on TruthfulQA for GLM4-32B), and analyses show that CSS better preserves activation energy.
翻译:激活导向技术能够在推理时对大型语言模型进行参数高效的控制,但许多方法依赖于分布外监督和离散掩码,导致干预效果脆弱。我们提出了ROAST(基于Rollout的分布内激活导向技术),该方法通过ROC从模型自身的分布内rollout中估计导向方向,并借助连续软缩放与分组均值归一化避免硬稀疏化。我们的实证分析表明,虽然激活幅度与方向一致性存在中等相关性,但幅度的方差显著且常与语义质量不成比例。这表明若不进行适当归一化,高幅度激活可能主导全局导向方向。为此,ROAST采用分组归一化来平衡不同样本间的贡献,从而确保对共识导向方向进行更稳健的估计。在不同规模模型(0.6B至32B)上的实验表明,ROAST在多样化任务中持续提升性能(例如Qwen3-0.6B在GSM8K上提升9.7%,GLM4-32B在TruthfulQA上提升12.1%),分析结果同时显示CSS能更好地保留激活能量。