Recent advancements in Large Language Models empower them to follow freeform instructions, including imitating generic or specific demographic personas in conversations. We define generic personas to represent demographic groups, such as "an Asian person", whereas specific personas may take the form of specific popular Asian names like "Yumi". While the adoption of personas enriches user experiences by making dialogue systems more engaging and approachable, it also casts a shadow of potential risk by exacerbating social biases within model responses, thereby causing societal harm through interactions with users. In this paper, we systematically study "persona biases", which we define to be the sensitivity of dialogue models' harmful behaviors contingent upon the personas they adopt. We categorize persona biases into biases in harmful expression and harmful agreement, and establish a comprehensive evaluation framework to measure persona biases in five aspects: Offensiveness, Toxic Continuation, Regard, Stereotype Agreement, and Toxic Agreement. Additionally, we propose to investigate persona biases by experimenting with UNIVERSALPERSONA, a systematically constructed persona dataset encompassing various types of both generic and specific model personas. Through benchmarking on four different models -- including Blender, ChatGPT, Alpaca, and Vicuna -- our study uncovers significant persona biases in dialogue systems. Our findings also underscore the pressing need to revisit the use of personas in dialogue agents to ensure safe application.
翻译:近期大型语言模型的进展使其能够遵循自由形式的指令,包括在对话中模仿通用或特定的人口统计学角色。我们将通用角色定义为代表人口统计学群体(如“一个亚洲人”),而特定角色可能表现为具体的常见亚洲名字(如“Yumi”)。尽管采用角色通过使对话系统更具吸引力和亲和力丰富了用户体验,但也通过加剧模型回应中的社会偏见投下潜在风险的阴影,从而在与用户的互动中造成社会危害。本文系统性地研究了“角色偏见”——我们将其定义为对话模型的有害行为对其采用的角色敏感的程度。我们将角色偏见分为有害表达偏见和有害认同偏见,并建立了一个全面的评估框架,从五个方面衡量角色偏见:冒犯性、有害延续、尊重度、刻板印象认同和有害认同。此外,我们提出通过使用UNIVERSALPERSONA(一个系统性构建的角色数据集,涵盖通用和特定模型角色的多种类型)进行实验来研究角色偏见。通过在四个不同模型(包括Blender、ChatGPT、Alpaca和Vicuna)上的基准测试,我们的研究揭示了对话系统中显著的角色偏见。我们的发现也强调了重新审视对话智能体中使用角色以确保安全应用的紧迫性。