Text-based safety classifiers are widely used for content moderation and increasingly to tune generative language model behavior - a topic of growing concern for the safety of digital assistants and chatbots. However, different policies require different classifiers, and safety policies themselves improve from iteration and adaptation. This paper introduces and evaluates methods for agile text classification, whereby classifiers are trained using small, targeted datasets that can be quickly developed for a particular policy. Experimenting with 7 datasets from three safety-related domains, comprising 15 annotation schemes, led to our key finding: prompt-tuning large language models, like PaLM 62B, with a labeled dataset of as few as 80 examples can achieve state-of-the-art performance. We argue that this enables a paradigm shift for text classification, especially for models supporting safer online discourse. Instead of collecting millions of examples to attempt to create universal safety classifiers over months or years, classifiers could be tuned using small datasets, created by individuals or small organizations, tailored for specific use cases, and iterated on and adapted in the time-span of a day.
翻译:基于文本的安全分类器广泛用于内容审核,并日益用于调整生成式语言模型的行为——这已成为数字助手和聊天机器人安全领域日益关注的话题。然而,不同的政策需要不同的分类器,且安全政策本身通过迭代和适应得以改进。本文介绍并评估了敏捷文本分类的方法,即使用小型、有针对性的数据集训练分类器,这些数据集可针对特定政策快速开发。通过在来自三个安全相关领域的7个数据集(包含15种注释方案)上进行实验,我们得出关键发现:使用仅80个样例的标注数据集对大型语言模型(如PaLM 62B)进行提示微调,即可实现最先进的性能。我们认为,这为文本分类领域带来了范式转变,尤其是对于支持更安全在线讨论的模型而言。无需耗时数月或数年收集数百万个样例来尝试构建通用安全分类器,而是可以使用个人或小型组织创建的小型数据集,在一天内为特定用例定制、迭代和适配分类器。