Language model alignment has become an important component of AI safety, allowing safe interactions between humans and language models, by enhancing desired behaviors and inhibiting undesired ones. It is often done by tuning the model or inserting preset aligning prompts. Recently, representation engineering, a method which alters the model's behavior via changing its representations post-training, was shown to be effective in aligning LLMs (Zou et al., 2023a). Representation engineering yields gains in alignment oriented tasks such as resistance to adversarial attacks and reduction of social biases, but was also shown to cause a decrease in the ability of the model to perform basic tasks. In this paper we study the tradeoff between the increase in alignment and decrease in helpfulness of the model. We propose a theoretical framework which provides bounds for these two quantities, and demonstrate their relevance empirically. Interestingly, we find that while the helpfulness generally decreases, it does so quadratically with the norm of the representation engineering vector, while the alignment increases linearly with it, indicating a regime in which it is efficient to use representation engineering. We validate our findings empirically, and chart the boundaries to the usefulness of representation engineering for alignment.
翻译:语言模型对齐已成为AI安全的重要组成部分,通过强化期望行为并抑制非期望行为,可实现人类与语言模型间的安全交互。这一过程通常通过微调模型或插入预设对齐提示来实现。近期,表征工程(一种通过改变模型后训练表征来调整模型行为的方法)被证明能有效对齐大语言模型(Zou等人,2023a)。表征工程在面向对齐的任务中(如抵御对抗攻击、减少社会偏见)取得了显著成效,但也被发现会削弱模型执行基础任务的能力。本文研究模型对齐度提升与有用性下降之间的权衡关系。我们提出一个理论框架,为这两个量提供界域约束,并实证验证其相关性。有趣的是,虽然有用性普遍下降,但其下降幅度与表征工程向量的范数呈二次关系,而对齐度则与该范数呈线性关系,这表明存在一个高效应用表征工程的区间。我们通过实验验证了上述发现,并划定了表征工程在对齐任务中的效能边界。