Large language models (LLMs) have become integral to our professional workflows and daily lives. Nevertheless, these machine companions of ours have a critical flaw: the huge amount of data which endows them with vast and diverse knowledge, also exposes them to the inevitable toxicity and bias. While most LLMs incorporate defense mechanisms to prevent the generation of harmful content, these safeguards can be easily bypassed with minimal prompt engineering. In this paper, we introduce the new Thoroughly Engineered Toxicity (TET) dataset, comprising manually crafted prompts designed to nullify the protective layers of such models. Through extensive evaluations, we demonstrate the pivotal role of TET in providing a rigorous benchmark for evaluation of toxicity awareness in several popular LLMs: it highlights the toxicity in the LLMs that might remain hidden when using normal prompts, thus revealing subtler issues in their behavior.
翻译:大型语言模型(LLM)已深度融入我们的专业工作流程与日常生活。然而,这些机器伙伴存在一个关键缺陷:赋予其海量知识的数据积累,同时使其不可避免地接触到毒性内容与偏见。尽管多数LLM配备了防御机制以阻止有害内容的生成,但通过简单的提示工程便可轻易突破这些安全屏障。本文提出了全新设计的深度工程化毒性(TET)数据集,该数据集包含通过人工精心设计的提示,旨在瓦解此类模型的防护层。通过广泛评估,我们验证了TET在构建严格基准测试中的关键作用——该基准用于评估多种主流LLM的毒性感知能力:它揭示了使用常规提示时可能隐藏的毒性表现,从而暴露出模型行为中更细微的问题。