Which Demographics do LLMs Default to During Annotation?

Johannes Schäfer,Aidan Combs,Christopher Bagdon,Jiahui Li,Nadine Probol,Lynn Greschner,Sean Papay,Yarik Menchaca Resendiz,Aswathy Velutharambath,Amelie Wührl,Sabine Weber,Roman Klinger

Demographics and cultural background of annotators influence the labels they assign in text annotation -- for instance, an elderly woman might find it offensive to read a message addressed to a "bro", but a male teenager might find it appropriate. It is therefore important to acknowledge label variations to not under-represent members of a society. Two research directions developed out of this observation in the context of using large language models (LLM) for data annotations, namely (1) studying biases and inherent knowledge of LLMs and (2) injecting diversity in the output by manipulating the prompt with demographic information. We combine these two strands of research and ask the question to which demographics an LLM resorts to when no demographics is given. To answer this question, we evaluate which attributes of human annotators LLMs inherently mimic. Furthermore, we compare non-demographic conditioned prompts and placebo-conditioned prompts (e.g., "you are an annotator who lives in house number 5") to demographics-conditioned prompts ("You are a 45 year old man and an expert on politeness annotation. How do you rate {instance}"). We study these questions for politeness and offensiveness annotations on the POPQUORN data set, a corpus created in a controlled manner to investigate human label variations based on demographics which has not been used for LLM-based analyses so far. We observe notable influences related to gender, race, and age in demographic prompting, which contrasts with previous studies that found no such effects.

翻译：标注者的人口统计特征与文化背景会影响其在文本标注中分配的标签——例如，一位年长女性可能认为称呼"bro"的信息具有冒犯性，而男性青少年则可能认为其恰当。因此，承认标签的多样性对于避免社会成员被忽视至关重要。基于此观察，在使用大语言模型进行数据标注的研究中发展出两个方向：（1）研究大语言模型的偏见与内在知识；（2）通过提示词注入人口统计信息以增强输出多样性。本研究整合这两个研究方向，探讨当未提供人口统计信息时，大语言模型会默认采用何种人口统计特征。为回答此问题，我们评估了大语言模型本质上模仿了人类标注者的哪些属性。此外，我们比较了非人口统计条件提示、安慰剂条件提示（如"你是一位住在5号房屋的标注者"）与人口统计条件提示（如"你是一位45岁男性，是礼貌标注专家。请评价{实例}"）。我们基于POPQUORN数据集对礼貌性与冒犯性标注展开研究，该语料库以受控方式构建，旨在探究基于人口统计特征的人类标签差异，此前尚未被用于基于大语言模型的分析。我们观察到人口统计提示在性别、种族和年龄维度上存在显著影响，这与先前研究发现无此类效应的结论形成对比。