The increasing capabilities of large generative models and their ever more widespread deployment have raised concerns about their reliability, safety, and potential misuse. To address these issues, recent works have proposed to control model generation by steering model activations in order to effectively induce or prevent the emergence of concepts or behaviors in the generated output. In this paper we introduce Activation Transport (AcT), a general framework to steer activations guided by optimal transport theory that generalizes many previous activation-steering works. AcT is modality-agnostic and provides fine-grained control over the model behavior with negligible computational overhead, while minimally impacting model abilities. We experimentally show the effectiveness and versatility of our approach by addressing key challenges in large language models (LLMs) and text-to-image diffusion models (T2Is). For LLMs, we show that AcT can effectively mitigate toxicity, induce arbitrary concepts, and increase their truthfulness. In T2Is, we show how AcT enables fine-grained style control and concept negation.
翻译:大型生成模型日益增强的能力及其愈发广泛的部署引发了对其可靠性、安全性及潜在滥用的担忧。为解决这些问题,近期研究提出通过引导模型激活来控制模型生成,以有效诱导或阻止生成输出中出现特定概念或行为。本文引入激活传输(AcT),这是一个基于最优传输理论引导激活的通用框架,它概括了许多先前的激活引导工作。AcT与模态无关,能以可忽略的计算开销提供对模型行为的细粒度控制,同时对模型能力的影响最小。我们通过解决大型语言模型(LLMs)和文到图扩散模型(T2Is)中的关键挑战,实验性地证明了我们方法的有效性和通用性。对于LLMs,我们表明AcT能有效减轻毒性、诱导任意概念并提高其真实性。在T2Is中,我们展示了AcT如何实现细粒度的风格控制和概念消除。