Toxicity in ChatGPT: Analyzing Persona-assigned Language Models

Large language models (LLMs) have shown incredible capabilities and transcended the natural language processing (NLP) community, with adoption throughout many services like healthcare, therapy, education, and customer service. Since users include people with critical information needs like students or patients engaging with chatbots, the safety of these systems is of prime importance. Therefore, a clear understanding of the capabilities and limitations of LLMs is necessary. To this end, we systematically evaluate toxicity in over half a million generations of ChatGPT, a popular dialogue-based LLM. We find that setting the system parameter of ChatGPT by assigning it a persona, say that of the boxer Muhammad Ali, significantly increases the toxicity of generations. Depending on the persona assigned to ChatGPT, its toxicity can increase up to 6x, with outputs engaging in incorrect stereotypes, harmful dialogue, and hurtful opinions. This may be potentially defamatory to the persona and harmful to an unsuspecting user. Furthermore, we find concerning patterns where specific entities (e.g., certain races) are targeted more than others (3x more) irrespective of the assigned persona, that reflect inherent discriminatory biases in the model. We hope that our findings inspire the broader AI community to rethink the efficacy of current safety guardrails and develop better techniques that lead to robust, safe, and trustworthy AI systems.

翻译：大型语言模型（LLMs）展现出非凡能力，其应用已超越自然语言处理（NLP）领域，广泛渗透至医疗保健、心理治疗、教育及客户服务等诸多行业。由于用户群体包含学生、患者等存在关键信息需求的人群，这些系统的安全性至关重要。因此，深入理解LLMs的能力边界与局限性十分必要。本研究系统评估了基于对话的流行LLM——ChatGPT在超过五十万次生成中的毒性表现。研究发现，通过为ChatGPT设定系统参数赋予特定角色（如赋予拳击手穆罕默德·阿里的人设），其生成内容的毒性显著增强。根据分配角色的不同，ChatGPT的毒性最高可增加6倍，输出内容常涉及不当刻板印象、有害对话及伤害性观点。这不仅可能对角色本身构成诋毁，更可能对不知情用户造成伤害。此外，我们还发现了令人担忧的规律：无论分配何种角色，特定群体（如某些种族）遭受攻击的频率是其他群体的三倍以上，这反映出模型中固有的歧视性偏见。我们期望本研究的发现能够启发更广泛的人工智能社区重新审视当前安全屏障的有效性，从而开发出更优技术，构建稳健、安全且值得信赖的人工智能系统。

相关内容

ChatGPT

关注 258

ChatGPT（全名：Chat Generative Pre-trained Transformer），美国OpenAI 研发的聊天机器人程序 [1] ，于2022年11月30日发布。ChatGPT是人工智能技术驱动的自然语言处理工具，它能够通过学习和理解人类的语言来进行对话，还能根据聊天的上下文进行互动，真正像人类一样来聊天交流，甚至能完成撰写邮件、视频脚本、文案、翻译、代码，写论文任务。 [1] https://openai.com/blog/chatgpt/