Steering methods influence Large Language Model behavior by identifying semantic directions in hidden representations, but are typically realized through inference-time activation interventions that apply a fixed, global modification to the model's internal states. While effective, such interventions often induce unfavorable attribute-utility trade-offs under strong control, as they ignore the fact that many behaviors are governed by a small and heterogeneous subset of model components. We propose Steer2Edit, a theoretically grounded, training-free framework that transforms steering vectors from inference-time control signals into diagnostic signals for component-level rank-1 weight editing. Instead of uniformly injecting a steering direction during generation, Steer2Edit selectively redistributes behavioral influence across individual attention heads and MLP neurons, yielding interpretable edits that preserve the standard forward pass and remain compatible with optimized parallel inference. Across safety alignment, hallucination mitigation, and reasoning efficiency, Steer2Edit consistently achieves more favorable attribute-utility trade-offs: at matched downstream performance, it improves safety by up to 17.2%, increases truthfulness by 9.8%, and reduces reasoning length by 12.2% on average. Overall, Steer2Edit provides a principled bridge between representation steering and weight editing by translating steering signals into interpretable, training-free parameter updates.
翻译:导向方法通过识别隐藏表示中的语义方向来影响大语言模型的行为,但通常通过推理时的激活干预实现,即对模型内部状态施加固定、全局的修改。尽管有效,此类干预在强控制下常导致不利的属性-效用权衡,因为它们忽略了多数行为由模型中少量异质组件子集支配的事实。我们提出Steer2Edit——一个具有理论依据、无需训练的新框架,将导向向量从推理时的控制信号转化为用于组件级秩-1权重编辑的诊断信号。Steer2Edit并非在生成过程中统一注入导向方向,而是选择性地在单个注意力头与MLP神经元间重新分配行为影响,从而产生可解释的编辑操作。该方法保持了标准前向传播过程,且与优化的并行推理兼容。在安全性对齐、幻觉缓解和推理效率等任务中,Steer2Edit始终实现更优的属性-效用权衡:在保持下游性能不变的前提下,其安全性最高提升17.2%,真实性平均提高9.8%,推理长度平均减少12.2%。总体而言,Steer2Edit通过将导向信号转化为可解释的免训练参数更新,为表示导向与权重编辑建立了原则性桥梁。