Large Language Models (LLMs) have demonstrated remarkable capabilities in generating human-like text, but their output may not be aligned with the user or even produce harmful content. This paper presents a novel approach to detect and steer concepts such as toxicity before generation. We introduce the Sparse Conditioned Autoencoder (SCAR), a single trained module that extends the otherwise untouched LLM. SCAR ensures full steerability, towards and away from concepts (e.g., toxic content), without compromising the quality of the model's text generation on standard evaluation benchmarks. We demonstrate the effective application of our approach through a variety of concepts, including toxicity, safety, and writing style alignment. As such, this work establishes a robust framework for controlling LLM generations, ensuring their ethical and safe deployment in real-world applications.
翻译:大语言模型(LLMs)在生成类人文本方面展现出卓越能力,但其输出可能未与用户意图对齐,甚至产生有害内容。本文提出一种在生成前检测并引导诸如毒性等概念的新方法。我们引入了稀疏条件自编码器(SCAR),这是一个经过训练的单一模块,可扩展原本未经修改的大语言模型。SCAR确保对概念(例如有毒内容)实现完全可引导性——既可趋近亦可远离——同时在大语言模型标准评估基准上保持文本生成质量不受影响。我们通过毒性、安全性和写作风格对齐等多种概念的应用验证了该方法的有效性。因此,本研究为控制大语言模型生成建立了稳健框架,确保其在实际应用中的伦理与安全部署。