PaCE: Parsimonious Concept Engineering for Large Language Models

Large Language Models (LLMs) are being used for a wide variety of tasks. While they are capable of generating human-like responses, they can also produce undesirable output including potentially harmful information, racist or sexist language, and hallucinations. Alignment methods are designed to reduce such undesirable outputs via techniques such as fine-tuning, prompt engineering, and representation engineering. However, existing methods face several challenges: some require costly fine-tuning for every alignment task; some do not adequately remove undesirable concepts, failing alignment; some remove benign concepts, lowering the linguistic capabilities of LLMs. To address these issues, we propose Parsimonious Concept Engineering (PaCE), a novel activation engineering framework for alignment. First, to sufficiently model the concepts, we construct a large-scale concept dictionary in the activation space, in which each atom corresponds to a semantic concept. Given any alignment task, we instruct a concept partitioner to efficiently annotate the concepts as benign or undesirable. Then, at inference time, we decompose the LLM activations along the concept dictionary via sparse coding, to accurately represent the activations as linear combinations of benign and undesirable components. By removing the latter ones from the activations, we reorient the behavior of the LLM towards the alignment goal. We conduct experiments on tasks such as response detoxification, faithfulness enhancement, and sentiment revising, and show that PaCE achieves state-of-the-art alignment performance while maintaining linguistic capabilities.

翻译：大语言模型（LLMs）已被广泛应用于各类任务。尽管它们能够生成类人响应，但也可能产生不良输出，包括潜在有害信息、种族或性别歧视性语言以及幻觉。对齐方法旨在通过微调、提示工程和表示工程等技术减少此类不良输出。然而，现有方法面临若干挑战：部分方法需要对每个对齐任务进行成本高昂的微调；部分方法未能充分消除不良概念，导致对齐失败；部分方法移除了良性概念，降低了LLMs的语言能力。为解决这些问题，我们提出简约概念工程（PaCE），一种新颖的激活工程对齐框架。首先，为充分建模概念，我们在激活空间中构建大规模概念词典，其中每个原子对应一个语义概念。给定任意对齐任务，我们指导概念划分器高效地将概念标注为良性或不良。随后，在推理阶段，我们通过稀疏编码将LLM激活沿概念词典分解，从而将激活精确表示为良性与不良分量的线性组合。通过从激活中移除后者，我们将LLM的行为重新导向对齐目标。我们在响应去毒化、忠实性增强和情感修正等任务上进行了实验，结果表明PaCE在保持语言能力的同时实现了最先进的对齐性能。

相关内容

Engineering

关注 7

《工程》是中国工程院（CAE）于2015年推出的国际开放存取期刊。其目的是提供一个高水平的平台，传播和分享工程研发的前沿进展、当前主要研究成果和关键成果；报告工程科学的进展，讨论工程发展的热点、兴趣领域、挑战和前景，在工程中考虑人与环境的福祉和伦理道德，鼓励具有深远经济和社会意义的工程突破和创新，使之达到国际先进水平，成为新的生产力，从而改变世界，造福人类，创造新的未来。期刊链接：https://www.sciencedirect.com/journal/engineering

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日