Inference-time LLM alignment methods, particularly activation steering, offer an alternative to fine-tuning by directly modifying activations during generation. Existing methods, however, often rely on non-anticipative interventions that ignore how perturbations propagate through transformer layers and lack online error feedback, resulting in suboptimal, open-loop control. To address this, we show empirically that, despite the nonlinear structure of transformer blocks, layer-wise dynamics across multiple LLM architectures and scales are well-approximated by locally-linear models. Exploiting this property, we model LLM inference as a linear time-varying dynamical system and adapt the classical linear quadratic regulator to compute feedback controllers using layer-wise Jacobians, steering activations toward desired semantic setpoints in closed-loop with minimal computational overhead and no offline training. We also derive theoretical bounds on setpoint tracking error, enabling formal guarantees on steering performance. Using a novel adaptive semantic feature setpoint signal, our method yields robust, fine-grained behavior control across models, scales, and tasks, including state-of-the-art modulation of toxicity, truthfulness, refusal, and arbitrary concepts, surpassing baseline steering methods. Our code is available at: https://github.com/trustworthyrobotics/lqr-activation-steering
翻译:推理时大语言模型对齐方法,特别是激活引导技术,通过在生成过程中直接修改激活值来替代微调。然而,现有方法通常依赖非预见性干预,既忽略扰动在Transformer层间的传播效应,也缺乏在线误差反馈,导致次优的开环控制。为解决这一问题,我们通过实证表明:尽管Transformer块具有非线性结构,但多种大语言模型架构和规模下的逐层动力学均可被局部线性模型良好近似。利用这一性质,我们将大语言模型推理建模为时变线性动力系统,并改进经典线性二次型调节器,利用逐层雅可比矩阵计算反馈控制器,在无需离线训练且计算开销极小的条件下,以闭环形式将激活值引导至期望语义设定点。我们还推导了设定点跟踪误差的理论边界,从而为引导性能提供形式化保证。通过使用新型自适应语义特征设定点信号,我们的方法能够在不同模型、规模和任务中实现鲁棒且精细的行为控制,包括对毒性、真实性、拒绝回答及任意概念的最新调控效果,全面超越基线引导方法。我们的代码开源在:https://github.com/trustworthyrobotics/lqr-activation-steering