NSFW (Not Safe for Work) content, in the context of a dialogue, can have severe side effects on users in open-domain dialogue systems. However, research on detecting NSFW language, especially sexually explicit content, within a dialogue context has significantly lagged behind. To address this issue, we introduce CensorChat, a dialogue monitoring dataset aimed at NSFW dialogue detection. Leveraging knowledge distillation techniques involving GPT-4 and ChatGPT, this dataset offers a cost-effective means of constructing NSFW content detectors. The process entails collecting real-life human-machine interaction data and breaking it down into single utterances and single-turn dialogues, with the chatbot delivering the final utterance. ChatGPT is employed to annotate unlabeled data, serving as a training set. Rationale validation and test sets are constructed using ChatGPT and GPT-4 as annotators, with a self-criticism strategy for resolving discrepancies in labeling. A BERT model is fine-tuned as a text classifier on pseudo-labeled data, and its performance is assessed. The study emphasizes the importance of AI systems prioritizing user safety and well-being in digital conversations while respecting freedom of expression. The proposed approach not only advances NSFW content detection but also aligns with evolving user protection needs in AI-driven dialogues.
翻译:NSFW(不宜在工作场合浏览)内容在开放域对话系统中可能对用户产生严重的负面影响。然而,针对对话语境中NSFW语言(尤其是性露骨内容)检测的研究明显滞后。为解决这一问题,我们提出了CensorChat——一个面向NSFW对话检测的对话监控数据集。该数据集利用GPT-4和ChatGPT的知识蒸馏技术,为构建NSFW内容检测器提供了一种经济高效的方法。其流程包括收集真实人机交互数据,并将其拆分为单条话语和单轮对话,最终由聊天机器人输出最后一条语句。我们采用ChatGPT对未标注数据进行标注,将其作为训练集。利用ChatGPT和GPT-4作为标注员构建理由验证集和测试集,并引入自我批评策略解决标注分歧。基于伪标注数据对BERT模型进行微调作为文本分类器,并评估其性能。本研究强调了AI系统在数字对话中优先考虑用户安全与福祉、同时尊重言论自由的重要性。所提出的方法不仅推动了NSFW内容检测技术的发展,也契合了AI驱动对话中不断演进的用户保护需求。