Dialogue safety remains a pervasive challenge in open-domain human-machine interaction. Existing approaches propose distinctive dialogue safety taxonomies and datasets for detecting explicitly harmful responses. However, these taxonomies may not be suitable for analyzing response safety in mental health support. In real-world interactions, a model response deemed acceptable in casual conversations might have a negligible positive impact on users seeking mental health support. To address these limitations, this paper aims to develop a theoretically and factually grounded taxonomy that prioritizes the positive impact on help-seekers. Additionally, we create a benchmark corpus with fine-grained labels for each dialogue session to facilitate further research. We analyze the dataset using popular language models, including BERT-base, RoBERTa-large, and ChatGPT, to detect and understand unsafe responses within the context of mental health support. Our study reveals that ChatGPT struggles to detect safety categories with detailed safety definitions in a zero- and few-shot paradigm, whereas the fine-tuned model proves to be more suitable. The developed dataset and findings serve as valuable benchmarks for advancing research on dialogue safety in mental health support, with significant implications for improving the design and deployment of conversation agents in real-world applications. We release our code and data here: https://github.com/qiuhuachuan/DialogueSafety.
翻译:对话安全在人机开放域交互中仍是一个普遍挑战。现有方法提出了独特的对话安全分类体系和数据集,用于检测显性有害回复。然而,这些分类体系可能不适用于分析心理健康支持中的回复安全性。在真实交互中,在随意对话中被认为可接受的模型回复,可能对寻求心理健康支持的用户产生微不足道的正面影响。为解决这些局限,本文旨在发展一个基于理论和事实的分类体系,优先考虑对求助者的正面影响。此外,我们构建了一个基准语料库,为每个对话会话提供细粒度标签,以促进后续研究。我们使用主流语言模型(包括BERT-base、RoBERTa-large和ChatGPT)分析该数据集,以检测和理解心理健康支持语境中的不安全回复。研究表明,ChatGPT在零样本和少样本范式下难以检测具有详细安全定义的类别,而微调模型则更适用。所开发的数据集及发现为推进心理健康支持中的对话安全研究提供了有价值的基准,对改善真实应用中对话代理的设计与部署具有重要启示。我们将代码和数据发布于此:https://github.com/qiuhuachuan/DialogueSafety。