Language models (LMs) are known to represent the perspectives of some social groups better than others, which may impact their performance, especially on subjective tasks such as content moderation and hate speech detection. To explore how LMs represent different perspectives, existing research focused on positional alignment, i.e., how closely the models mimic the opinions and stances of different groups, e.g., liberals or conservatives. However, human communication also encompasses emotional and moral dimensions. We define the problem of affective alignment, which measures how LMs' emotional and moral tone represents those of different groups. By comparing the affect of responses generated by 36 LMs to the affect of Twitter messages, we observe significant misalignment of LMs with both ideological groups. This misalignment is larger than the partisan divide in the U.S. Even after steering the LMs towards specific ideological perspectives, the misalignment and liberal tendencies of the model persist, suggesting a systemic bias within LMs.
翻译:众所周知,语言模型(LMs)在表征某些社会群体的视角方面优于其他群体,这可能会影响其性能,尤其是在内容审核和仇恨言论检测等主观任务上。为探究语言模型如何表征不同视角,现有研究集中于立场对齐,即模型在多大程度上模仿不同群体(如自由派或保守派)的观点与立场。然而,人类交流同样包含情感与道德维度。我们定义了情感对齐问题,用以衡量语言模型的情感与道德基调在多大程度上代表了不同群体的相应特征。通过比较36个语言模型生成回复的情感特征与Twitter消息的情感特征,我们观察到语言模型与两个意识形态群体均存在显著错位。这种错位甚至大于美国的党派分歧。即使将语言模型引导至特定的意识形态视角后,模型的错位与自由派倾向依然存在,这表明语言模型内部存在系统性偏见。