JADE: A Linguistics-based Safety Evaluation Platform for Large Language Models

In this paper, we present JADE, a targeted linguistic fuzzing platform which strengthens the linguistic complexity of seed questions to simultaneously and consistently break a wide range of widely-used LLMs categorized in three groups: eight open-sourced Chinese, six commercial Chinese and four commercial English LLMs. JADE generates three safety benchmarks for the three groups of LLMs, which contain unsafe questions that are highly threatening: the questions simultaneously trigger harmful generation of multiple LLMs, with an average unsafe generation ratio of $70\%$ (please see the table below), while are still natural questions, fluent and preserving the core unsafe semantics. We release the benchmark demos generated for commercial English LLMs and open-sourced English LLMs in the following link: https://github.com/whitzard-ai/jade-db. For readers who are interested in evaluating on more questions generated by JADE, please contact us. JADE is based on Noam Chomsky's seminal theory of transformational-generative grammar. Given a seed question with unsafe intention, JADE invokes a sequence of generative and transformational rules to increment the complexity of the syntactic structure of the original question, until the safety guardrail is broken. Our key insight is: Due to the complexity of human language, most of the current best LLMs can hardly recognize the invariant evil from the infinite number of different syntactic structures which form an unbound example space that can never be fully covered. Technically, the generative/transformative rules are constructed by native speakers of the languages, and, once developed, can be used to automatically grow and transform the parse tree of a given question, until the guardrail is broken. For more evaluation results and demo, please check our website: https://whitzard-ai.github.io/jade.html.

翻译：本文提出JADE，一个面向语言学的定向模糊测试平台，通过增强种子问题的语言复杂度，持续同步突破三类广泛使用的LLM的安全防护：八种开源中文、六种商用中文及四种商用英文大语言模型。JADE为这三类LLM构建了三个安全基准，其中包含具有高度威胁性的不安全问题：这些问题能同时触发多个LLM的有害生成，平均不安全生成率达70%（详见下表），同时仍保持自然语句的流畅性与核心不安全语义。我们已将商用英文和开源英文LLM的基准演示发布至以下链接：https://github.com/whitzard-ai/jade-db。若读者希望测试JADE生成的更多问题，欢迎联系我们。JADE基于Noam Chomsky的转换生成语法理论。给定包含不安全意图的种子问题，JADE调用一系列生成与转换规则，逐步增加原始问题的句法结构复杂度，直至安全护栏被突破。我们的核心洞察在于：由于人类语言的复杂性，当前大多数最先进的LLM难以从无限数量的不同句法结构中识别不变的恶意意图——这些句法结构构成了一个永远无法被完全覆盖的无界样本空间。技术上，生成/转换规则由各语言母语者构建，一旦开发完成，即可自动生长并转换给定问题的解析树，直至护栏被突破。更多评估结果与演示请访问我们的网站：https://whitzard-ai.github.io/jade.html。