This paper conducts a robustness audit of the safety feedback of PaLM 2 through a novel toxicity rabbit hole framework introduced here. Starting with a stereotype, the framework instructs PaLM 2 to generate more toxic content than the stereotype. Every subsequent iteration it continues instructing PaLM 2 to generate more toxic content than the previous iteration until PaLM 2 safety guardrails throw a safety violation. Our experiments uncover highly disturbing antisemitic, Islamophobic, racist, homophobic, and misogynistic (to list a few) generated content that PaLM 2 safety guardrails do not evaluate as highly unsafe. We briefly discuss the generalizability of this framework across eight other large language models.
翻译:本文通过引入一种新颖的“毒性兔子洞”框架,对PaLM 2的安全反馈机制进行了鲁棒性审计。该框架以某一刻板印象为起点,指示PaLM 2生成比该刻板印象更具毒性的内容。在随后的每次迭代中,它持续要求PaLM 2生成比上一轮迭代毒性更强的文本,直至PaLM 2的安全护栏触发违规警告。实验揭示了大量极端令人不安的反犹太、仇视伊斯兰、种族主义、恐同及厌女等内容(仅举几例),而PaLM 2的安全护栏并未将这些内容评估为高度不安全。我们简要讨论了该框架在其他八种大型语言模型中的普适性。