Value alignment of Large Language Models (LLMs) requires us to empirically measure these models' actual, acquired representation of value. Among the characteristics of value representation in humans is that they distinguish among value of different kinds. We investigate whether LLMs likewise distinguish three different kinds of good: moral, grammatical, and economic. By probing model behavior, embeddings, and residual stream activations, we report pervasive cases of value entanglement: a conflation between these distinct representations of value. Specifically, both grammatical and economic valuation was found to be overly influenced by moral value, relative to human norms. This conflation was repaired by selective ablation of the activation vectors associated with morality.
翻译:大型语言模型的价值对齐要求我们通过实证方法测量这些模型实际习得的价值表征。人类价值表征的特征之一在于能够区分不同类型的价值。本研究探究了大型语言模型是否同样能区分三种不同的善:道德之善、语法之善与经济之善。通过探测模型行为、嵌入表示及残差流激活状态,我们报告了普遍存在的价值纠缠现象:即这些不同价值表征之间的混淆。具体而言,研究发现语法评价与经济评价均受到道德价值的过度影响,这种影响程度偏离了人类规范。通过选择性消融与道德相关的激活向量,这种混淆现象得到了修复。