This paper makes three contributions. First, it presents a generalizable, novel framework dubbed \textit{toxicity rabbit hole} that iteratively elicits toxic content from a wide suite of large language models. Spanning a set of 1,266 identity groups, we first conduct a bias audit of \texttt{PaLM 2} guardrails presenting key insights. Next, we report generalizability across several other models. Through the elicited toxic content, we present a broad analysis with a key emphasis on racism, antisemitism, misogyny, Islamophobia, homophobia, and transphobia. Finally, driven by concrete examples, we discuss potential ramifications.
翻译:本文做出三项贡献。首先,提出一个可推广的新框架,称为“毒性兔子洞”(toxicity rabbit hole),能够迭代地从一系列大型语言模型中诱导出有毒内容。涵盖1,266个身份群体,我们首先对PaLM 2的安全护栏进行偏见审计,得出关键见解。随后,报告该框架在多个其他模型上的可推广性。通过诱导出的有毒内容,我们进行了广泛分析,重点关注种族主义、反犹太主义、厌女症、伊斯兰恐惧症、同性恋恐惧症和跨性别恐惧症。最后,基于具体示例,我们讨论了潜在影响。