Despite remarkable advances that large language models have achieved in chatbots, maintaining a non-toxic user-AI interactive environment has become increasingly critical nowadays. However, previous efforts in toxicity detection have been mostly based on benchmarks derived from social media content, leaving the unique challenges inherent to real-world user-AI interactions insufficiently explored. In this work, we introduce ToxicChat, a novel benchmark based on real user queries from an open-source chatbot. This benchmark contains the rich, nuanced phenomena that can be tricky for current toxicity detection models to identify, revealing a significant domain difference compared to social media content. Our systematic evaluation of models trained on existing toxicity datasets has shown their shortcomings when applied to this unique domain of ToxicChat. Our work illuminates the potentially overlooked challenges of toxicity detection in real-world user-AI conversations. In the future, ToxicChat can be a valuable resource to drive further advancements toward building a safe and healthy environment for user-AI interactions.
翻译:尽管大型语言模型在聊天机器人领域取得了显著进展,但维护无毒性的人机交互环境如今变得日益关键。然而,以往的毒性检测研究大多基于社交媒体内容构建的基准数据集,未能充分探索真实用户-人工智能交互中存在的独特挑战。本研究提出了ToxicChat——一个基于开源聊天机器人真实用户查询的新型基准数据集。该数据集包含当前毒性检测模型难以识别的丰富而微妙的现象,揭示了其与社交媒体内容相比存在的显著领域差异。我们系统评估了基于现有毒性数据集训练的模型,发现它们在ToxicChat这一独特领域中的应用存在不足。本研究揭示了真实用户-人工智能对话中毒性检测可能被忽视的挑战。未来,ToxicChat将作为宝贵资源推动构建安全健康的人机交互环境。