Language is a deep-rooted means of perpetration of stereotypes and discrimination. Large Language Models (LLMs), now a pervasive technology in our everyday lives, can cause extensive harm when prone to generating toxic responses. The standard way to address this issue is to align the LLM , which, however, dampens the issue without constituting a definitive solution. Therefore, testing LLM even after alignment efforts remains crucial for detecting any residual deviations with respect to ethical standards. We present EvoTox, an automated testing framework for LLMs' inclination to toxicity, providing a way to quantitatively assess how much LLMs can be pushed towards toxic responses even in the presence of alignment. The framework adopts an iterative evolution strategy that exploits the interplay between two LLMs, the System Under Test (SUT) and the Prompt Generator steering SUT responses toward higher toxicity. The toxicity level is assessed by an automated oracle based on an existing toxicity classifier. We conduct a quantitative and qualitative empirical evaluation using five state-of-the-art LLMs as evaluation subjects having increasing complexity (7-671B parameters). Our quantitative evaluation assesses the cost-effectiveness of four alternative versions of EvoTox against existing baseline methods, based on random search, curated datasets of toxic prompts, and adversarial attacks. Our qualitative assessment engages human evaluators to rate the fluency of the generated prompts and the perceived toxicity of the responses collected during the testing sessions. Results indicate that the effectiveness, in terms of detected toxicity level, is significantly higher than the selected baseline methods (effect size up to 1.0 against random search and up to 0.99 against adversarial attacks). Furthermore, EvoTox yields a limited cost overhead (from 22% to 35% on average).
翻译:语言是根深蒂固的刻板印象与歧视的传播媒介。大型语言模型(LLMs)作为当今日常生活中的普及技术,若倾向于生成有害回复,可能造成广泛危害。解决此问题的标准方法是对齐LLM,然而这仅能缓解问题而非根本解决。因此,即使在模型对齐后,对LLM进行测试对于检测其与伦理标准间的残余偏差仍至关重要。本文提出EvoTox——一个自动化测试框架,用于评估LLM的毒性倾向,为量化衡量LLM即使在模型对齐后仍可能被诱导产生有害回复的程度提供方法。该框架采用迭代进化策略,利用两个LLM(待测系统SUT与引导SUT生成更高毒性回复的提示生成器)之间的交互作用。毒性水平通过基于现有毒性分类器的自动化预言机进行评估。我们使用五个参数规模递增(7-671B参数)的先进LLM作为评估对象,进行了定量与定性的实证评估。定量评估比较了EvoTox四种变体与现有基线方法(基于随机搜索、精选毒性提示数据集和对抗攻击)的成本效益。定性评估邀请人工评估者对测试过程中生成提示的流畅度及所收集回复的感知毒性进行评级。结果表明,EvoTox在检测到的毒性水平方面的有效性显著高于所选基线方法(相较于随机搜索的效应量高达1.0,相较于对抗攻击高达0.99)。此外,EvoTox产生的额外成本有限(平均为22%至35%)。