Patient simulators are gaining traction in mental health training by providing scalable exposure to complex and sensitive patient interactions. Simulating depressed patients is particularly challenging, as safety constraints and high patient variability complicate simulations and underscore the need for simulators that capture diverse and realistic patient behaviors. However, existing evaluations heavily rely on LLM-judges with poorly specified prompts and do not assess behavioral diversity. We introduce PSI-Bench, an automatic evaluation framework that provides interpretable, clinically grounded diagnostics of depression patient simulator behavior across turn-, dialogue-, and population-level dimensions. Using PSI-Bench, we benchmark seven LLMs across two simulator frameworks and find that simulators produce overly long, lexically diverse responses, show reduced variability, resolve emotions too quickly, and follow a uniform negative-to-positive trajectory. We also show that the simulation framework has a larger impact on fidelity than the model scale. Results from a human study demonstrate that our benchmark is strongly aligned with expert judgments. Our work reveals key limitations of current depression patient simulators and provides an interpretable, extensible benchmark to guide future simulator design and evaluation.
翻译:患者模拟器通过提供可扩展的复杂敏感患者交互体验,在心理健康培训领域日益受到关注。模拟抑郁症患者尤为困难,安全约束与患者异质性不仅使模拟过程更加复杂,更凸显了构建能捕获患者多样化真实行为模拟器的必要性。然而,现有评估方法严重依赖提示词定义模糊的大语言模型评判者,且未评估行为多样性。我们提出PSI-Bench——一个自动评估框架,能在话轮级、对话级和群体级三个维度对抑郁症患者模拟行为进行可解释的、基于临床的诊断。通过PSI-Bench,我们评估了七个大语言模型在两种模拟框架下的表现,发现模拟器产生过长且词汇多样化的回应、行为变异性降低、情绪消解过快,并遵循统一的从消极到积极的变化轨迹。我们还发现模拟框架对保真度的影响大于模型规模。人工研究结果表明,我们的基准与专家判断高度一致。本工作揭示了当前抑郁症患者模拟器的关键局限性,并为指导未来模拟器设计与评估提供了可解释、可扩展的基准。