The increasing prevalence of Large Language Models (LMs) in critical applications highlights the need for controlled language generation strategies that are not only computationally efficient but that also enjoy performance guarantees. To achieve this, we use a common model of concept semantics as linearly represented in an LM's latent space. In particular, we take the view that natural language generation traces a trajectory in this continuous semantic space, realized by the language model's hidden activations. This view permits a control-theoretic treatment of text generation in latent space, in which we propose a lightweight, gradient-free intervention that dynamically steers trajectories away from regions corresponding to undesired meanings. Crucially, we show that this intervention, which we compute in closed form, is guaranteed (in probability) to steer the output into the allowed region. Finally, we demonstrate on a toxicity avoidance objective that the intervention steers language away from undesired content while maintaining text quality.
翻译:大型语言模型(LM)在关键应用中的日益普及凸显了对受控语言生成策略的需求,这些策略不仅需要计算高效,还应具备性能保证。为实现这一目标,我们采用了一种通用概念语义模型,该模型在LM的潜在空间中呈线性表示。具体而言,我们认为自然语言生成在该连续语义空间中描绘出一条轨迹,并通过语言模型的隐藏激活实现。这一视角允许在潜在空间中对文本生成进行控制理论处理,我们提出了一种轻量级、无梯度的干预方法,能动态引导轨迹远离对应不良语义的区域。关键的是,我们证明这种以闭式计算的干预方法能够(以概率)保证将输出引导至允许区域内。最后,我们在毒性规避任务上验证了该干预方法能在保持文本质量的同时,有效引导语言远离不良内容。