LinEAS: End-to-end Learning of Activation Steering with a Distributional Loss

The growing use of generative models in daily life calls for efficient mechanisms to control their generation, to e.g., produce safe content or provide users with tools to explore style changes. Ideally, such mechanisms should require low volume of unpaired data (i.e., without explicit preference), and should be cheap, both at train and inference time, while preserving output quality. Recent research has shown that such mechanisms can be obtained by intervening exclusively on model activations, with the goal of correcting distributional differences between activations seen when using prompts from a source vs. a target set (e.g., toxic and non-toxic sentences). While cheap, these fast methods are inherently crude: their maps are tuned locally, not accounting for their impact on downstream layers, resulting in interventions that cause unintended shifts when used out-of-sample. We propose in this work linear end-to-end activation steering (LinEAS), an approach trained with a global loss that accounts simultaneously for all layer-wise distributional shifts. In addition to being more robust, the loss used to train LinEAS can be regularized with sparsifying norms, which can automatically carry out neuron selection. LinEAS only requires a handful of unpaired samples to be effective, and beats similar baselines on toxicity mitigation in language models, becoming competitive with oracle-dependent methods that have access to strong supervision. LinEAS is modality-agnostic and we empirically find that it outperforms existing activation steering methods at mitigating and including new concepts at the output of single-step text-to-image generation models.

翻译：随着生成模型在日常生活中的广泛应用，亟需开发高效的生成控制机制，例如用于生成安全内容或为用户提供风格探索工具。理想的机制应满足以下要求：仅需少量非配对数据（即无需显式偏好标注）、训练与推理成本低廉，同时保持输出质量。近期研究表明，通过专门干预模型激活可实现此类机制，其核心目标是修正使用源提示集（如有毒语句）与目标提示集（如无毒语句）时激活的分布差异。尽管这类快速方法成本低廉，但其本质上是粗略的：其映射函数仅通过局部调优获得，未考虑对下游层的影响，导致在样本外使用时引发非预期偏移。本文提出线性端到端激活导向方法（LinEAS），该方法采用全局损失函数进行训练，可同时处理所有层间的分布偏移。LinEAS不仅具有更强的鲁棒性，其训练损失还可通过稀疏范数进行正则化，从而自动实现神经元选择。LinEAS仅需少量非配对样本即可生效，在语言模型的毒性缓解任务中超越同类基线方法，其性能甚至可与依赖强监督信息的理想化方法相媲美。LinEAS具有模态无关性，实证研究表明，在单步文本到图像生成模型中，该方法在输出端的概念抑制与概念引入任务上均优于现有激活导向方法。