ODESteer：基于常微分方程的统一LLM对齐引导框架 (ODESteer: A Unified ODE-Based Steering Framework for LLM Alignment)

Activation steering, or representation engineering, offers a lightweight approach to align large language models (LLMs) by manipulating their internal activations at inference time. However, current methods suffer from two key limitations: \textit{(i)} the lack of a unified theoretical framework for guiding the design of steering directions, and \textit{(ii)} an over-reliance on \textit{one-step steering} that fail to capture complex patterns of activation distributions. In this work, we propose a unified ordinary differential equations (ODEs)-based \textit{theoretical} framework for activation steering in LLM alignment. We show that conventional activation addition can be interpreted as a first-order approximation to the solution of an ODE. Based on this ODE perspective, identifying a steering direction becomes equivalent to designing a \textit{barrier function} from control theory. Derived from this framework, we introduce ODESteer, a kind of ODE-based steering guided by barrier functions, which shows \textit{empirical} advancement in LLM alignment. ODESteer identifies steering directions by defining the barrier function as the log-density ratio between positive and negative activations, and employs it to construct an ODE for \textit{multi-step and adaptive} steering. Compared to state-of-the-art activation steering methods, ODESteer achieves consistent empirical improvements on diverse LLM alignment benchmarks, a notable $5.7\%$ improvement over TruthfulQA, $2.5\%$ over UltraFeedback, and $2.4\%$ over RealToxicityPrompts. Our work establishes a principled new view of activation steering in LLM alignment by unifying its theoretical foundations via ODEs, and validating it empirically through the proposed ODESteer method.

翻译：激活引导，或称表征工程，提供了一种轻量级方法，通过在大语言模型（LLM）推理时操控其内部激活来实现对齐。然而，现有方法存在两个关键局限：\textit{(i)} 缺乏用于指导引导方向设计的统一理论框架；\textit{(ii)} 过度依赖\textit{单步引导}，无法捕捉激活分布的复杂模式。本文中，我们提出了一个基于常微分方程（ODEs）的统一\textit{理论}框架，用于LLM对齐中的激活引导。我们证明，传统的激活加法可被解释为ODE解的一阶近似。基于此ODE视角，识别引导方向等价于设计控制理论中的\textit{屏障函数}。基于该框架，我们提出了ODESteer，一种由屏障函数引导的基于ODE的引导方法，其在LLM对齐中展现出\textit{实证}优势。ODESteer通过将屏障函数定义为正负激活之间的对数密度比来识别引导方向，并利用其构建用于\textit{多步自适应}引导的ODE。与最先进的激活引导方法相比，ODESteer在多样化的LLM对齐基准测试中取得了持续的实证改进，在TruthfulQA上显著提升$5.7\%$，在UltraFeedback上提升$2.5\%$，在RealToxicityPrompts上提升$2.4\%$。我们的工作通过ODE统一了激活引导的理论基础，并借助提出的ODESteer方法进行了实证验证，从而为LLM对齐中的激活引导建立了一个原则性的新视角。