Recent advancements in Large Language Models empower them to follow freeform instructions, including imitating generic or specific demographic personas in conversations. Generic personas refer to an individual from a demographic group (e.g. an Asian person), whereas specific personas can be actual names of historical figures. While the adoption of personas allows dialogue systems to be more engaging and approachable to users, it also carries the potential risk of exacerbating social biases in model responses, further causing societal harms through interactions with users. In this paper, we systematically study "persona biases", which we define to be the sensitivity of harmful dialogue model behaviors to different persona adoptions. We categorize persona biases into biases in harmful expression and harmful agreement, as well as establish a comprehensive evaluation framework to measure persona biases in five aspects: Offensiveness, Toxic Continuation, Regard, Stereotype Agreement, and Toxic Agreement. Additionally, we propose to comprehensively investigate persona biases through experimenting with UniversalPersona, a systematized persona dataset with a comprehensive list of both generic and specific model personas. Through benchmarking on four different models, including Blender, ChatGPT, Alpaca, and Vicuna, our study uncovers significant persona biases in these dialogue systems.Findings of our study underscores the immediate need to revisit the use of persona traits in dialogue agents, to ensure their safe application.
翻译:近期大型语言模型的进步使其能够遵循自由形态的指令,包括在对话中模仿通用或特定的人口统计人物属性。通用人物属性指来自特定人口群体的个体(例如亚洲人),而特定人物属性则可以是历史人物的真实姓名。尽管采用人物属性使对话系统对用户更具吸引力和亲和力,但也可能加剧模型回应中的社会偏见,进而通过与用户的互动造成社会危害。本文系统性地研究了“人物属性偏差”,定义为不同人物属性采用对有害对话模型行为的敏感度。我们将人物属性偏差分为有害表达偏差和有害认同偏差,并建立了一个综合评估框架,从五个维度衡量人物属性偏差:冒犯性、有毒延续性、尊重度、刻板印象认同和有毒认同。此外,我们提出通过实验研究通用人物属性数据集UniversalPersona(包含通用和特定模型人物属性的全面列表)来系统性地探索人物属性偏差。通过对Blender、ChatGPT、Alpaca和Vicuna四种不同模型的基准测试,我们的研究揭示了这些对话系统中显著的人物属性偏差。研究结果强调,亟需重新审视对话代理中人物属性特征的使用,以确保其安全应用。