How Toxic Can You Get? Search-based Toxicity Testing for Large Language Models

Language is a deep-rooted means of perpetration of stereotypes and discrimination. Large Language Models (LLMs), now a pervasive technology in our everyday lives, can cause extensive harm when prone to generating toxic responses. The standard way to address this issue is to align the LLM, which, however, dampens the issue without constituting a definitive solution. Therefore, testing LLM even after alignment efforts remains crucial for detecting any residual deviations with respect to ethical standards. We present EvoTox, an automated testing framework for LLMs' inclination to toxicity, providing a way to quantitatively assess how much LLMs can be pushed towards toxic responses even in the presence of alignment. The framework adopts an iterative evolution strategy that exploits the interplay between two LLMs, the System Under Test (SUT) and the Prompt Generator steering SUT responses toward higher toxicity. The toxicity level is assessed by an automated oracle based on an existing toxicity classifier. We conduct a quantitative and qualitative empirical evaluation using four state-of-the-art LLMs as evaluation subjects having increasing complexity (7-13 billion parameters). Our quantitative evaluation assesses the cost-effectiveness of four alternative versions of EvoTox against existing baseline methods, based on random search, curated datasets of toxic prompts, and adversarial attacks. Our qualitative assessment engages human evaluators to rate the fluency of the generated prompts and the perceived toxicity of the responses collected during the testing sessions. Results indicate that the effectiveness, in terms of detected toxicity level, is significantly higher than the selected baseline methods (effect size up to 1.0 against random search and up to 0.99 against adversarial attacks). Furthermore, EvoTox yields a limited cost overhead (from 22% to 35% on average).

翻译：语言是根深蒂固的刻板印象与歧视的传播媒介。大型语言模型（LLMs）作为当今日常生活中无处不在的技术，若倾向于生成有害回复，可能造成广泛危害。解决此问题的标准方法是对齐LLM，然而这仅能缓解问题而无法提供根本性解决方案。因此，即使在模型对齐后，对LLM进行测试对于检测其与伦理标准之间的残余偏差仍至关重要。本文提出EvoTox——一种用于评估LLM毒性倾向的自动化测试框架，该框架能够量化评估即使在模型对齐后，LLM被诱导生成有害回复的潜在程度。该框架采用迭代进化策略，通过两个LLM（待测系统SUT与引导SUT生成更高毒性回复的提示生成器）之间的交互作用实现毒性激发。毒性水平由基于现有毒性分类器的自动化预言机进行评估。我们使用四个具有递增复杂度（70-130亿参数）的先进LLM作为评估对象，进行了定量与定性的实证评估。定量评估比较了EvoTox四个变体版本与现有基线方法（基于随机搜索、人工构建的有害提示数据集及对抗攻击）的成本效益。定性评估则邀请人类评估者对测试过程中生成提示的流畅度及所收集回复的感知毒性进行评级。结果表明，EvoTox在检测毒性水平方面的有效性显著优于所选基线方法（相较于随机搜索的效果量高达1.0，相较于对抗攻击高达0.99）。此外，EvoTox仅产生有限成本开销（平均22%至35%）。