MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models

Large language models (LLMs) are increasingly explored as scalable tools for mental health counseling, yet evaluating their safety remains challenging due to the interactional and context-dependent nature of clinical harm. Existing evaluation frameworks predominantly assess isolated responses using coarse-grained taxonomies or static datasets, limiting their ability to diagnose how harms emerge and accumulate over multi-turn counseling interactions. In this work, we introduce R-MHSafe, a role-aware mental health safety taxonomy that characterizes clinically significant harm in terms of the interactional roles an AI counselor adopts, including perpetrator, instigator, facilitator, or enabler, combined with clinically grounded harm categories. Then, we propose MHSafeEval, a closed-loop, agent-based evaluation framework that formulates safety assessment as trajectory-level discovery of harm through adversarial multi-turn interactions, guided by role-aware modeling. Using R-MHSafe and MHSafeEval, we conduct a large-scale evaluation across state-of-the-art LLMs. Our results reveal substantial role-dependent and cumulative safety failures that are systematically missed by existing static benchmarks, and show that our framework significantly improves failure-mode coverage and diagnostic granularity.

翻译：大语言模型作为可扩展的心理健康咨询工具日益受到关注，但由于临床危害具有互动性和语境依赖性，评估其安全性仍面临挑战。现有评估框架主要使用粗粒度分类法或静态数据集评估孤立回复，难以诊断危害如何在多轮咨询互动中产生和累积。本研究提出R-MHSafe角色感知心理健康安全分类体系，该体系根据AI咨询师所扮演的角色（包括施害者、煽动者、助长者或包庇者）以及临床危害分类，对具有临床意义的危害进行特征化描述。进而提出MHSafeEval闭环智能体评估框架，该框架将安全评估形式化为通过对抗式多轮交互在轨迹层面发现危害的过程，并受角色感知建模指导。基于R-MHSafe与MHSafeEval，我们对当前最先进的大语言模型开展了大规模评估。结果显示，存在现有静态基准系统性遗漏的显著角色依赖与累积性安全故障，且本框架可大幅提升故障模式覆盖范围与诊断精细化程度。

相关内容

健康

关注 27

健康是指一个人在身体、精神和社会等方面都处于良好的状态。健康包括两个方面的内容：

一是主要脏器无疾病，身体形态发育良好，体形均匀，人体各系统具有良好的生理功能，有较强的身体活动能力和劳动能力，这是对健康最基本的要求；

二是对疾病的抵抗能力较强，能够适应环境变化，各种生理刺激以及致病因素对身体的作用。传统的健康观是“无病即健康”，现代人的健康观是整体健康，世界卫生组织提出“健康不仅是躯体没有疾病，还要具备心理健康、社会适应良好和有道德”。因此，现代人的健康内容包括：躯体健康、心理健康、心灵健康、社会健康、智力健康、道德健康、环境健康等。健康是人的基本权利。健康是人生的第一财富。

北大团队发布首篇大语言模型心理测量学系统综述：评估、验证、增强

专知会员服务

10+阅读 · 2025年5月27日

158页！天大等最新《大型语言模型安全：全面综述》

专知会员服务

50+阅读 · 2024年12月24日

《大型语言模型情感认知》最新进展

专知会员服务

43+阅读 · 2024年10月3日

迈向可信的人工智能：伦理和稳健的大型语言模型综述

专知会员服务

39+阅读 · 2024年7月28日