JADE: A Linguistics-based Safety Evaluation Platform for LLM

In this paper, we present JADE, a targeted linguistic fuzzing platform which strengthens the linguistic complexity of seed questions to simultaneously and consistently break a wide range of widely-used LLMs categorized in three groups: eight open-sourced Chinese, six commercial Chinese and four commercial English LLMs. JADE generates three safety benchmarks for the three groups of LLMs, which contain unsafe questions that are highly threatening: the questions simultaneously trigger harmful generation of multiple LLMs, with an average unsafe generation ratio of $70\%$ (please see the table below), while are still natural questions, fluent and preserving the core unsafe semantics. We release the benchmark demos generated for commercial English LLMs and open-sourced English LLMs in the following link: https://github.com/whitzard-ai/jade-db. For readers who are interested in evaluating on more questions generated by JADE, please contact us. JADE is based on Noam Chomsky's seminal theory of transformational-generative grammar. Given a seed question with unsafe intention, JADE invokes a sequence of generative and transformational rules to increment the complexity of the syntactic structure of the original question, until the safety guardrail is broken. Our key insight is: Due to the complexity of human language, most of the current best LLMs can hardly recognize the invariant evil from the infinite number of different syntactic structures which form an unbound example space that can never be fully covered. Technically, the generative/transformative rules are constructed by native speakers of the languages, and, once developed, can be used to automatically grow and transform the parse tree of a given question, until the guardrail is broken. For more evaluation results and demo, please check our website: https://whitzard-ai.github.io/jade.html.

翻译：本文提出JADE——一个基于语言学的定向模糊测试平台，通过增强种子问题的语言复杂性，同步且一致地攻破三类广泛使用的大型语言模型：八个开源中文模型、六个商用中文模型及四个商用英文模型。JADE为这三类模型生成了三个安全基准测试集，其中包含高度威胁性的不安全问题：这些问题能同时触发多个模型生成有害内容，平均不安全生成率达70%（详见下表），同时仍保持自然流畅的表述形态并保留核心不安全语义。我们已在以下链接发布针对商用英文模型和开源英文模型生成的基准测试样例：https://github.com/whitzard-ai/jade-db。对本平台生成更多问题感兴趣的读者，欢迎联系我们。JADE基于诺姆·乔姆斯基的转换生成语法理论。给定一个具有不安全意图的种子问题，JADE调用一系列生成式和转换式规则，逐步增加原始问题句法结构的复杂性，直至突破安全防护机制。核心洞见在于：由于人类语言的复杂性，当前最优的大语言模型难以从无限多样的句法结构中识别恒常存在的恶意意图——这些结构构成了永不可完全覆盖的无界样本空间。在技术实现上，生成/转换规则由母语使用者构建，一旦开发完成，即可自动生成并转换给定问题的解析树，直至安全防护失效。更多评估结果与演示请访问我们的网站：https://whitzard-ai.github.io/jade.html。