When is Your LLM Steerable?

Activation steering offers a lightweight approach to control language models' behavior at inference time, but whether it succeeds or fails heavily depends on the prompt, concept, model, and steering configuration. Finding the regime and boundaries of successful steering typically requires expensive grid searches and post-hoc evaluation of full autoregressive rollouts. In this work, we investigate whether steerability can be predicted from the model's internal states at the beginning of the generation process, e.g., after generating the first few tokens, and how to leverage such a predictor to improve steering success rate. To this end, we first introduce ASTEER, a testbed including 1.4M steered generations, spanning 150 concepts with each steering success/failure labeled. Leveraging this testbed, we analyze the model's early decoding dynamics by extracting features that compare hidden states before and after steering across layers and initial decoding steps. These features help us understand how steering's effects propagate along layers and token positions, which provide key information for steerability prediction. We then train a Gradient Boosting Decision Trees (GBDT) classifier on these features to predict whether an intervention will under-steer, succeed, or over-steer without requiring full rollout. Our predictor achieves around 0.7 macro-F1 score on unseen concepts, demonstrating that early hidden states encode substantial, structured information about eventual steering efficacy. We further leverage this steerability predictor as guidance for steering strength searching, achieving near-optimal performance with a small fraction of decoding cost.

翻译：激活引导提供了一种轻量级方法，可在推理时控制语言模型的行为，但其成败在很大程度上取决于提示词、概念、模型和引导配置。寻找成功引导的区间和边界通常需要耗费大量计算资源进行网格搜索，并对完整的自回归生成结果进行事后评估。本文研究能否从模型生成过程初始阶段（例如生成前几个词元后）的内部状态预测可操控性，以及如何利用这种预测器来提升引导成功率。为此，我们首先构建了ASTEER测试平台，包含140万次引导生成结果，覆盖150个概念，并标注了每次引导的成功/失败状态。利用该测试平台，我们通过提取跨层和初始解码步骤中引导前后隐状态的对比特征，分析了模型的早期解码动态。这些特征有助于理解引导效果如何沿层和词元位置传播，为可操控性预测提供关键信息。随后，我们基于这些特征训练梯度提升决策树（GBDT）分类器，无需完整生成结果即可预测干预措施将导致欠引导、成功还是过引导。该预测器在未见概念上实现了约0.7的宏F1分数，表明早期隐状态编码了关于最终引导效果的大量结构化信息。我们进一步将这一可操控性预测器作为引导强度搜索的指导，以极小的解码成本实现了接近最优的性能。