Large language models (LLMs) have shown incredible capabilities and transcended the natural language processing (NLP) community, with adoption throughout many services like healthcare, therapy, education, and customer service. Since users include people with critical information needs like students or patients engaging with chatbots, the safety of these systems is of prime importance. Therefore, a clear understanding of the capabilities and limitations of LLMs is necessary. To this end, we systematically evaluate toxicity in over half a million generations of ChatGPT, a popular dialogue-based LLM. We find that setting the system parameter of ChatGPT by assigning it a persona, say that of the boxer Muhammad Ali, significantly increases the toxicity of generations. Depending on the persona assigned to ChatGPT, its toxicity can increase up to 6x, with outputs engaging in incorrect stereotypes, harmful dialogue, and hurtful opinions. This may be potentially defamatory to the persona and harmful to an unsuspecting user. Furthermore, we find concerning patterns where specific entities (e.g., certain races) are targeted more than others (3x more) irrespective of the assigned persona, that reflect inherent discriminatory biases in the model. We hope that our findings inspire the broader AI community to rethink the efficacy of current safety guardrails and develop better techniques that lead to robust, safe, and trustworthy AI systems.
翻译:大型语言模型(LLMs)展现出非凡能力,其应用已超越自然语言处理(NLP)领域,广泛渗透至医疗保健、心理治疗、教育及客户服务等诸多行业。由于用户群体包含学生、患者等存在关键信息需求的人群,这些系统的安全性至关重要。因此,深入理解LLMs的能力边界与局限性十分必要。本研究系统评估了基于对话的流行LLM——ChatGPT在超过五十万次生成中的毒性表现。研究发现,通过为ChatGPT设定系统参数赋予特定角色(如赋予拳击手穆罕默德·阿里的人设),其生成内容的毒性显著增强。根据分配角色的不同,ChatGPT的毒性最高可增加6倍,输出内容常涉及不当刻板印象、有害对话及伤害性观点。这不仅可能对角色本身构成诋毁,更可能对不知情用户造成伤害。此外,我们还发现了令人担忧的规律:无论分配何种角色,特定群体(如某些种族)遭受攻击的频率是其他群体的三倍以上,这反映出模型中固有的歧视性偏见。我们期望本研究的发现能够启发更广泛的人工智能社区重新审视当前安全屏障的有效性,从而开发出更优技术,构建稳健、安全且值得信赖的人工智能系统。