Tradeoffs Between Alignment and Helpfulness in Language Models with Representation Engineering

Language model alignment has become an important component of AI safety, allowing safe interactions between humans and language models, by enhancing desired behaviors and inhibiting undesired ones. It is often done by tuning the model or inserting preset aligning prompts. Recently, representation engineering, a method which alters the model's behavior via changing its representations post-training, was shown to be effective in aligning LLMs (Zou et al., 2023a). Representation engineering yields gains in alignment oriented tasks such as resistance to adversarial attacks and reduction of social biases, but was also shown to cause a decrease in the ability of the model to perform basic tasks. In this paper we study the tradeoff between the increase in alignment and decrease in helpfulness of the model. We propose a theoretical framework which provides bounds for these two quantities, and demonstrate their relevance empirically. First, we find that under the conditions of our framework, alignment can be guaranteed with representation engineering, and at the same time that helpfulness is harmed in the process. Second, we show that helpfulness is harmed quadratically with the norm of the representation engineering vector, while the alignment increases linearly with it, indicating a regime in which it is efficient to use representation engineering. We validate our findings empirically, and chart the boundaries to the usefulness of representation engineering for alignment.

翻译：语言模型对齐已成为人工智能安全的重要组成部分，通过增强期望行为并抑制非期望行为，实现人类与语言模型的安全交互。传统对齐方法通常涉及模型微调或预设对齐提示的插入。近期研究表明，表征工程——一种通过改变训练后模型内部表征来调整其行为的方法——能有效实现大语言模型的对齐（Zou等人，2023a）。该方法在提升对抗攻击抵抗力和减少社会偏见等对齐导向任务中表现显著，但也被发现会削弱模型执行基础任务的能力。本文系统研究了模型对齐性提升与有用性下降之间的权衡关系。我们提出了一个理论框架，为这两个量提供数学界定的同时通过实证验证其相关性。首先，在该框架条件下，我们证明表征工程可确保对齐性的实现，但此过程必然伴随有用性的损耗。其次，我们揭示有用性损耗与表征工程向量的范数呈二次关系增长，而对齐性提升仅呈线性增长，这表明存在表征工程的高效应用区间。我们通过实证数据验证了这些发现，并界定了表征工程在模型对齐应用中的有效边界。

相关内容

Engineering

关注 6

《工程》是中国工程院（CAE）于2015年推出的国际开放存取期刊。其目的是提供一个高水平的平台，传播和分享工程研发的前沿进展、当前主要研究成果和关键成果；报告工程科学的进展，讨论工程发展的热点、兴趣领域、挑战和前景，在工程中考虑人与环境的福祉和伦理道德，鼓励具有深远经济和社会意义的工程突破和创新，使之达到国际先进水平，成为新的生产力，从而改变世界，造福人类，创造新的未来。期刊链接：https://www.sciencedirect.com/journal/engineering

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日