Large Language Models (LLMs) have demonstrated remarkable capabilities in generating human-like text, but their output may not be aligned with the user or even produce harmful content. This paper presents a novel approach to detect and steer concepts such as toxicity before generation. We introduce the Sparse Conditioned Autoencoder (SCAR), a single trained module that extends the otherwise untouched LLM. SCAR ensures full steerability, towards and away from concepts (e.g., toxic content), without compromising the quality of the model's text generation on standard evaluation benchmarks. We demonstrate the effective application of our approach through a variety of concepts, including toxicity, safety, and writing style alignment. As such, this work establishes a robust framework for controlling LLM generations, ensuring their ethical and safe deployment in real-world applications.
翻译:大型语言模型(LLM)在生成类人文本方面展现出卓越能力,但其输出可能无法与用户意图对齐,甚至可能产生有害内容。本文提出一种在生成前检测并引导诸如毒性等概念的新方法。我们引入了稀疏条件自编码器(SCAR),这是一个经过训练的单模块,可扩展原本未经修改的LLM。SCAR在标准评估基准上不损害模型文本生成质量的前提下,实现了对概念(例如有害内容)的完全双向引导——既可强化亦可规避。我们通过毒性、安全性和写作风格对齐等多种概念验证了该方法的有效性。因此,本研究为控制LLM生成建立了一个稳健的框架,确保其在现实应用中的伦理与安全部署。