Large language models (LLMs) have convincing performance in a variety of downstream tasks. However, these systems are prone to generating undesirable outputs such as harmful and biased text. In order to remedy such generations, the development of guardrail (or detector) models has gained traction. Motivated by findings from developing a detector for social bias, we adopt the notion of a use-mention distinction - which we identified as the primary source of under-performance in the preliminary versions of our social bias detector. Armed with this information, we describe a fully extensible and reproducible synthetic data generation pipeline which leverages taxonomy-driven instructions to create targeted and labeled data. Using this pipeline, we generate over 300K unique contrastive samples and provide extensive experiments to systematically evaluate performance on a suite of open source datasets. We show that our method achieves competitive performance with a fraction of the cost in compute and offers insight into iteratively developing efficient and capable guardrail models. Warning: This paper contains examples of text which are toxic, biased, and potentially harmful.
翻译:大型语言模型(LLMs)在各种下游任务中展现出令人信服的性能。然而,这些系统容易生成不良输出,例如有害和带有偏见的文本。为了纠正此类生成内容,护栏(或检测器)模型的开发日益受到关注。基于开发社会偏见检测器的研究结果,我们采用了使用-提及区分这一概念——我们将其识别为早期版本社会偏见检测器性能不佳的主要原因。借助这一认识,我们描述了一个完全可扩展且可复现的合成数据生成流程,该流程利用分类驱动的指令来创建有针对性的标注数据。通过此流程,我们生成了超过30万个独特的对比样本,并进行了广泛的实验,以系统评估在一系列开源数据集上的性能。我们证明,该方法以较低的计算成本实现了具有竞争力的性能,并为迭代开发高效且强大的护栏模型提供了见解。警告:本文包含具有毒性、偏见和潜在危害性的文本示例。