Activation steering, or representation engineering, offers a lightweight approach to align large language models (LLMs) by manipulating their internal activations at inference time. However, current methods suffer from two key limitations: \textit{(i)} the lack of a unified theoretical framework for guiding the design of steering directions, and \textit{(ii)} an over-reliance on \textit{one-step steering} that fail to capture complex patterns of activation distributions. In this work, we propose a unified ordinary differential equations (ODEs)-based \textit{theoretical} framework for activation steering in LLM alignment. We show that conventional activation addition can be interpreted as a first-order approximation to the solution of an ODE. Based on this ODE perspective, identifying a steering direction becomes equivalent to designing a \textit{barrier function} from control theory. Derived from this framework, we introduce ODESteer, a kind of ODE-based steering guided by barrier functions, which shows \textit{empirical} advancement in LLM alignment. ODESteer identifies steering directions by defining the barrier function as the log-density ratio between positive and negative activations, and employs it to construct an ODE for \textit{multi-step and adaptive} steering. Compared to state-of-the-art activation steering methods, ODESteer achieves consistent empirical improvements on diverse LLM alignment benchmarks, a notable $5.7\%$ improvement over TruthfulQA, $2.5\%$ over UltraFeedback, and $2.4\%$ over RealToxicityPrompts. Our work establishes a principled new view of activation steering in LLM alignment by unifying its theoretical foundations via ODEs, and validating it empirically through the proposed ODESteer method.
翻译:激活引导,或称表征工程,提供了一种轻量级方法,通过在大语言模型(LLM)推理时操控其内部激活来实现对齐。然而,现有方法存在两个关键局限:\textit{(i)} 缺乏用于指导引导方向设计的统一理论框架;\textit{(ii)} 过度依赖\textit{单步引导},无法捕捉激活分布的复杂模式。本文中,我们提出了一个基于常微分方程(ODEs)的统一\textit{理论}框架,用于LLM对齐中的激活引导。我们证明,传统的激活加法可被解释为ODE解的一阶近似。基于此ODE视角,识别引导方向等价于设计控制理论中的\textit{屏障函数}。基于该框架,我们提出了ODESteer,一种由屏障函数引导的基于ODE的引导方法,其在LLM对齐中展现出\textit{实证}优势。ODESteer通过将屏障函数定义为正负激活之间的对数密度比来识别引导方向,并利用其构建用于\textit{多步自适应}引导的ODE。与最先进的激活引导方法相比,ODESteer在多样化的LLM对齐基准测试中取得了持续的实证改进,在TruthfulQA上显著提升$5.7\%$,在UltraFeedback上提升$2.5\%$,在RealToxicityPrompts上提升$2.4\%$。我们的工作通过ODE统一了激活引导的理论基础,并借助提出的ODESteer方法进行了实证验证,从而为LLM对齐中的激活引导建立了一个原则性的新视角。