Large language models (LLMs) are increasingly explored as scalable tools for mental health counseling, yet evaluating their safety remains challenging due to the interactional and context-dependent nature of clinical harm. Existing evaluation frameworks predominantly assess isolated responses using coarse-grained taxonomies or static datasets, limiting their ability to diagnose how harms emerge and accumulate over multi-turn counseling interactions. In this work, we introduce R-MHSafe, a role-aware mental health safety taxonomy that characterizes clinically significant harm in terms of the interactional roles an AI counselor adopts, including perpetrator, instigator, facilitator, or enabler, combined with clinically grounded harm categories. Then, we propose MHSafeEval, a closed-loop, agent-based evaluation framework that formulates safety assessment as trajectory-level discovery of harm through adversarial multi-turn interactions, guided by role-aware modeling. Using R-MHSafe and MHSafeEval, we conduct a large-scale evaluation across state-of-the-art LLMs. Our results reveal substantial role-dependent and cumulative safety failures that are systematically missed by existing static benchmarks, and show that our framework significantly improves failure-mode coverage and diagnostic granularity.
翻译:大语言模型作为可扩展的心理健康咨询工具日益受到关注,但由于临床危害具有互动性和语境依赖性,评估其安全性仍面临挑战。现有评估框架主要使用粗粒度分类法或静态数据集评估孤立回复,难以诊断危害如何在多轮咨询互动中产生和累积。本研究提出R-MHSafe角色感知心理健康安全分类体系,该体系根据AI咨询师所扮演的角色(包括施害者、煽动者、助长者或包庇者)以及临床危害分类,对具有临床意义的危害进行特征化描述。进而提出MHSafeEval闭环智能体评估框架,该框架将安全评估形式化为通过对抗式多轮交互在轨迹层面发现危害的过程,并受角色感知建模指导。基于R-MHSafe与MHSafeEval,我们对当前最先进的大语言模型开展了大规模评估。结果显示,存在现有静态基准系统性遗漏的显著角色依赖与累积性安全故障,且本框架可大幅提升故障模式覆盖范围与诊断精细化程度。