To develop a reliable AI for psychological assessment, we introduce \texttt{PsychEval}, a multi-session, multi-therapy, and highly realistic benchmark designed to address three key challenges: \textbf{1) Can we train a highly realistic AI counselor?} Realistic counseling is a longitudinal task requiring sustained memory and dynamic goal tracking. We propose a multi-session benchmark (spanning 6-10 sessions across three distinct stages) that demands critical capabilities such as memory continuity, adaptive reasoning, and longitudinal planning. The dataset is annotated with extensive professional skills, comprising over 677 meta-skills and 4577 atomic skills. \textbf{2) How to train a multi-therapy AI counselor?} While existing models often focus on a single therapy, complex cases frequently require flexible strategies among various therapies. We construct a diverse dataset covering five therapeutic modalities (Psychodynamic, Behaviorism, CBT, Humanistic Existentialist, and Postmodernist) alongside an integrative therapy with a unified three-stage clinical framework across six core psychological topics. \textbf{3) How to systematically evaluate an AI counselor?} We establish a holistic evaluation framework with 18 therapy-specific and therapy-shared metrics across Client-Level and Counselor-Level dimensions. To support this, we also construct over 2,000 diverse client profiles. Extensive experimental analysis fully validates the superior quality and clinical fidelity of our dataset. Crucially, \texttt{PsychEval} transcends static benchmarking to serve as a high-fidelity reinforcement learning environment that enables the self-evolutionary training of clinically responsible and adaptive AI counselors.
翻译:为开发可靠的AI心理评估系统,本文提出\texttt{PsychEval}——一个多会话、多疗法、高拟真度的基准测试框架,旨在解决三个关键挑战:\textbf{1) 能否训练出高拟真度的AI咨询师?} 真实的心理咨询是需持续记忆与动态目标追踪的纵向任务。我们构建了跨三个独立阶段(涵盖6-10次会话)的多会话基准,要求模型具备记忆连续性、自适应推理与长期规划等关键能力。数据集标注了涵盖677项元技能与4577项原子技能的广泛专业技能体系。\textbf{2) 如何训练多疗法AI咨询师?} 现有模型多聚焦单一疗法,但复杂案例常需跨疗法灵活调整策略。我们构建了涵盖五大治疗流派(精神动力学、行为主义、认知行为疗法、人本存在主义及后现代主义)的多样化数据集,并基于六类核心心理议题设计了具有统一三阶段临床框架的整合疗法。\textbf{3) 如何系统评估AI咨询师?} 我们建立了包含18项疗法专用及通用指标的全方位评估框架,覆盖来访者维度与咨询师维度。为此还构建了2000余个多样化来访者画像。大量实验分析充分验证了数据集的高质量与临床保真度。关键的是,\texttt{PsychEval}超越了静态基准测试,可作为高保真强化学习环境,支持具备临床责任感与自适应能力的AI咨询师实现自我进化训练。