The increasing capabilities of large generative models and their ever more widespread deployment have raised concerns about their reliability, safety, and potential misuse. To address these issues, recent works have proposed to control model generation by steering model activations in order to effectively induce or prevent the emergence of concepts or behaviors in the generated output. In this paper we introduce Activation Transport (AcT), a general framework to steer activations guided by optimal transport theory that generalizes many previous activation-steering works. AcT is modality-agnostic and provides fine-grained control over the model behavior with negligible computational overhead, while minimally impacting model abilities. We experimentally show the effectiveness and versatility of our approach by addressing key challenges in large language models (LLMs) and text-to-image diffusion models (T2Is). For LLMs, we show that AcT can effectively mitigate toxicity, induce arbitrary concepts, and increase their truthfulness. In T2Is, we show how AcT enables fine-grained style control and concept negation.
翻译:大型生成模型日益增强的能力及其愈发广泛的部署应用,引发了对其可靠性、安全性及潜在滥用的担忧。为解决这些问题,近期研究提出通过调控模型激活来控制模型生成,以有效诱导或阻止生成输出中特定概念或行为的出现。本文提出激活传输(Activation Transport,AcT)这一通用框架,该框架基于最优传输理论指导激活调控,可泛化多种先前的激活调控方法。AcT具有模态无关性,能以可忽略的计算开销实现对模型行为的细粒度控制,同时最小化对模型能力的影响。我们通过解决大语言模型(LLMs)和文到图扩散模型(T2Is)中的关键挑战,实验验证了该方法的有效性和普适性。针对LLMs,我们证明AcT能有效降低毒性、诱导任意概念并提升其真实性。在T2Is中,我们展示了AcT如何实现细粒度的风格控制和概念消除。