Do LLMs align with human perceptions of safety? We study this question via annotation alignment, the extent to which LLMs and humans agree when annotating the safety of user-chatbot conversations. We leverage the recent DICES dataset (Aroyo et al., 2023), in which 350 conversations are each rated for safety by 112 annotators spanning 10 race-gender groups. GPT-4 achieves a Pearson correlation of $r = 0.59$ with the average annotator rating, \textit{higher} than the median annotator's correlation with the average ($r=0.51$). We show that larger datasets are needed to resolve whether LLMs exhibit disparities in how well they correlate with different demographic groups. Also, there is substantial idiosyncratic variation in correlation within groups, suggesting that race & gender do not fully capture differences in alignment. Finally, we find that GPT-4 cannot predict when one demographic group finds a conversation more unsafe than another.
翻译:大型语言模型(LLM)是否与人类对安全性的认知保持一致?我们通过标注对齐——即LLM与人类在标注用户-聊天机器人对话安全性时的一致性程度——来研究这一问题。我们利用近期发布的DICES数据集(Aroyo等人,2023),该数据集包含350段对话,每段对话由涵盖10个种族-性别群体的112名标注者进行安全性评分。GPT-4与平均标注者评分之间的皮尔逊相关系数达到$r = 0.59$,\textit{高于}中位数标注者与平均评分的相关系数($r=0.51$)。研究表明,需要更大规模的数据集才能确定LLM在与不同人口统计群体的相关性中是否表现出差异。此外,群体内部的相关性存在显著的个体差异,这表明种族与性别不能完全解释对齐程度的区别。最后,我们发现GPT-4无法预测特定人口统计群体何时会比另一群体认为对话更具风险。