Large Language Models (LLMs) are being used for a wide variety of tasks. While they are capable of generating human-like responses, they can also produce undesirable output including potentially harmful information, racist or sexist language, and hallucinations. Alignment methods are designed to reduce such undesirable outputs via techniques such as fine-tuning, prompt engineering, and representation engineering. However, existing methods face several challenges: some require costly fine-tuning for every alignment task; some do not adequately remove undesirable concepts, failing alignment; some remove benign concepts, lowering the linguistic capabilities of LLMs. To address these issues, we propose Parsimonious Concept Engineering (PaCE), a novel activation engineering framework for alignment. First, to sufficiently model the concepts, we construct a large-scale concept dictionary in the activation space, in which each atom corresponds to a semantic concept. Given any alignment task, we instruct a concept partitioner to efficiently annotate the concepts as benign or undesirable. Then, at inference time, we decompose the LLM activations along the concept dictionary via sparse coding, to accurately represent the activations as linear combinations of benign and undesirable components. By removing the latter ones from the activations, we reorient the behavior of the LLM towards the alignment goal. We conduct experiments on tasks such as response detoxification, faithfulness enhancement, and sentiment revising, and show that PaCE achieves state-of-the-art alignment performance while maintaining linguistic capabilities.
翻译:大语言模型(LLMs)已被广泛应用于各类任务。尽管它们能够生成类人响应,但也可能产生不良输出,包括潜在有害信息、种族或性别歧视性语言以及幻觉。对齐方法旨在通过微调、提示工程和表示工程等技术减少此类不良输出。然而,现有方法面临若干挑战:部分方法需要对每个对齐任务进行成本高昂的微调;部分方法未能充分消除不良概念,导致对齐失败;部分方法移除了良性概念,降低了LLMs的语言能力。为解决这些问题,我们提出简约概念工程(PaCE),一种新颖的激活工程对齐框架。首先,为充分建模概念,我们在激活空间中构建大规模概念词典,其中每个原子对应一个语义概念。给定任意对齐任务,我们指导概念划分器高效地将概念标注为良性或不良。随后,在推理阶段,我们通过稀疏编码将LLM激活沿概念词典分解,从而将激活精确表示为良性与不良分量的线性组合。通过从激活中移除后者,我们将LLM的行为重新导向对齐目标。我们在响应去毒化、忠实性增强和情感修正等任务上进行了实验,结果表明PaCE在保持语言能力的同时实现了最先进的对齐性能。