Evaluating the safety alignment of LLM responses in high-risk mental health dialogues is particularly difficult due to missing gold-standard answers and the ethically sensitive nature of these interactions. To address this challenge, we propose PsyCrisis-Bench, a reference-free evaluation benchmark based on real-world Chinese mental health dialogues. It evaluates whether the model responses align with the safety principles defined by experts. Specifically designed for settings without standard references, our method adopts a prompt-based LLM-as-Judge approach that conducts in-context evaluation using expert-defined reasoning chains grounded in psychological intervention principles. We employ binary point-wise scoring across multiple safety dimensions to enhance the explainability and traceability of the evaluation. Additionally, we present a manually curated, high-quality Chinese-language dataset covering self-harm, suicidal ideation, and existential distress, derived from real-world online discourse. Experiments on 3600 judgments show that our method achieves the highest agreement with expert assessments and produces more interpretable evaluation rationales compared to existing approaches. Our dataset and evaluation tool are publicly available to facilitate further research.
翻译:在心理健康等高风险对话中评估LLM响应的安全对齐尤为困难,这既缺乏黄金标准答案,又涉及伦理敏感性。为应对这一挑战,我们提出PsyCrisis-Bench——一个基于真实世界中文心理健康对话的无参考评估基准。该基准评估模型响应是否符合专家定义的安全原则。针对无标准参考场景的专门设计使我们的方法采用基于提示的LLM-as-Judge框架,通过基于心理干预原则的专家定义推理链进行上下文评估。我们在多个安全维度上采用二元逐点评分机制,以增强评估的可解释性与可追溯性。此外,我们构建了一个从真实网络对话中提取、涵盖自伤行为、自杀意念与存在性痛苦的高质量中文人工标注数据集。基于3600条评估数据的实验表明,相较于现有方法,我们的方案与专家评估结果的一致性最高,且能产生更具可解释性的评估依据。本数据集与评估工具已公开,以促进后续研究。