Language models (LMs) are known to represent the perspectives of some social groups better than others, which may impact their performance, especially on subjective tasks such as content moderation and hate speech detection. To explore how LMs represent different perspectives, existing research focused on positional alignment, i.e., how closely the models mimic the opinions and stances of different groups, e.g., liberals or conservatives. However, human communication also encompasses emotional and moral dimensions. We define the problem of affective alignment, which measures how LMs' emotional and moral tone represents those of different groups. By comparing the affect of responses generated by 36 LMs to the affect of Twitter messages, we observe significant misalignment of LMs with both ideological groups. This misalignment is larger than the partisan divide in the U.S. Even after steering the LMs towards specific ideological perspectives, the misalignment and liberal tendencies of the model persist, suggesting a systemic bias within LMs.
翻译:语言模型(LM)已知能更好地代表某些社会群体的观点,而非其他群体,这可能会影响其性能,尤其是在内容审核和仇恨言论检测等主观任务中。为了探究LM如何代表不同观点,现有研究聚焦于立场对齐,即模型模仿不同群体(如自由派或保守派)观点和立场的紧密程度。然而,人类交流还包含情感和道德维度。我们定义了情感对齐问题,用于衡量LM的情感与道德基调如何代表不同群体。通过比较36个LM生成回答的情感与推特消息的情感,我们观察到LM与两类意识形态群体均存在显著不对齐。这种不对齐程度甚至大于美国两党分歧。即使将LM引导至特定意识形态观点,模型的不对齐和自由派倾向仍持续存在,表明LM内部存在系统性偏差。