Recent advancements in Large Language Models empower them to follow freeform instructions, including imitating generic or specific demographic personas in conversations. Generic personas refer to an individual from a demographic group (e.g. an Asian person), whereas specific personas can be actual names of historical figures. While the adoption of personas allows dialogue systems to be more engaging and approachable to users, it also carries the potential risk of exacerbating social biases in model responses, further causing societal harms through interactions with users. In this paper, we systematically study "persona biases", which we define to be the sensitivity of harmful dialogue model behaviors to different persona adoptions. We categorize persona biases into biases in harmful expression and harmful agreement, as well as establish a comprehensive evaluation framework to measure persona biases in five aspects: Offensiveness, Toxic Continuation, Regard, Stereotype Agreement, and Toxic Agreement. Additionally, we propose to comprehensively investigate persona biases through experimenting with UniversalPersona, a systematized persona dataset with a comprehensive list of both generic and specific model personas. Through benchmarking on four different models, including Blender, ChatGPT, Alpaca, and Vicuna, our study uncovers significant persona biases in these dialogue systems.Findings of our study underscores the immediate need to revisit the use of persona traits in dialogue agents, to ensure their safe application.
翻译:大型语言模型的最新进展使其能够遵循自由形式的指令,包括在对话中模仿通用或特定的人口统计角色。通用角色指来自某个人口统计群体的个体(例如一个亚洲人),而特定角色可以是历史人物的真实姓名。虽然采用角色使对话系统对用户更具吸引力和亲和力,但也可能加剧模型回应中的社会偏见,进而通过与用户的互动造成社会危害。本文系统性地研究了“角色偏见”——我们将其定义为不同角色采用对对话模型有害行为的敏感性。我们将角色偏见分为有害表达偏见和有害认同偏见,并建立了一个全面的评估框架,从五个方面衡量角色偏见:冒犯性、有毒延续、尊重程度、刻板印象认同和有毒认同。此外,我们提出通过使用UniversalPersona(一个包含通用和特定模型角色的系统化角色数据集)进行实验,全面研究角色偏见。通过对比Blender、ChatGPT、Alpaca和Vicuna四种模型的基准测试,我们的研究揭示了这些对话系统中存在的显著角色偏见。研究结果强调了亟需重新审视对话代理中角色特征的使用,以确保其安全应用。