Intersectionality in Conversational AI Safety: How Bayesian Multilevel Models Help Understand Diverse Perceptions of Safety

Conversational AI systems exhibit a level of human-like behavior that promises to have profound impacts on many aspects of daily life -- how people access information, create content, and seek social support. Yet these models have also shown a propensity for biases, offensive language, and conveying false information. Consequently, understanding and moderating safety risks in these models is a critical technical and social challenge. Perception of safety is intrinsically subjective, where many factors -- often intersecting -- could determine why one person may consider a conversation with a chatbot safe and another person could consider the same conversation unsafe. In this work, we focus on demographic factors that could influence such diverse perceptions. To this end, we contribute an analysis using Bayesian multilevel modeling to explore the connection between rater demographics and how raters report safety of conversational AI systems. We study a sample of 252 human raters stratified by gender, age group, race/ethnicity group, and locale. This rater pool provided safety labels for 1,340 human-chatbot conversations. Our results show that intersectional effects involving demographic characteristics such as race/ethnicity, gender, and age, as well as content characteristics, such as degree of harm, all play significant roles in determining the safety of conversational AI systems. For example, race/ethnicity and gender show strong intersectional effects, particularly among South Asian and East Asian women. We also find that conversational degree of harm impacts raters of all race/ethnicity groups, but that Indigenous and South Asian raters are particularly sensitive to this harm. Finally, we observe the effect of education is uniquely intersectional for Indigenous raters, highlighting the utility of multilevel frameworks for uncovering underrepresented social perspectives.

翻译：对话式AI系统展现出类人行为，有望对信息获取、内容创作及社交支持的日常互动产生深远影响。然而，这些模型也表现出偏见、攻击性语言及传播虚假信息的倾向。因此，理解并管控这些模型的安全风险成为关键的技术与社会挑战。安全感知本质上具有主观性，诸多交叉因素可能决定为何同一段对话对某人而言安全，而对他人却不安全。本研究聚焦于可能影响这种差异化感知的人口统计学因素。为此，我们运用贝叶斯多层模型分析评估者人口统计特征与对话式AI系统安全标注之间的关联。研究选取252名按性别、年龄段、种族/族裔及地域分层的人工评估者，为1340段人机对话提供安全标签。结果显示，种族/族裔、性别、年龄等人口统计特征的交叉效应，以及对话内容特征（如危害程度），均显著影响对话式AI系统的安全性判定。例如，种族/族裔与性别表现出强烈交叉效应，尤其在南亚裔与东亚裔女性群体中。同时发现，所有种族/族裔的评估者均受对话危害程度影响，但原住民与南亚裔评估者对危害感知尤为敏感。此外，教育背景对原住民评估者呈现独特交叉效应，凸显了多层框架在揭示少数群体社会视角中的价值。