PsychEval：面向高拟真AI心理咨询师的多会话多疗法基准测试 (PsychEval: A Multi-Session and Multi-Therapy Benchmark for High-Realism AI Psychological Counselor)

To develop a reliable AI for psychological assessment, we introduce \texttt{PsychEval}, a multi-session, multi-therapy, and highly realistic benchmark designed to address three key challenges: \textbf{1) Can we train a highly realistic AI counselor?} Realistic counseling is a longitudinal task requiring sustained memory and dynamic goal tracking. We propose a multi-session benchmark (spanning 6-10 sessions across three distinct stages) that demands critical capabilities such as memory continuity, adaptive reasoning, and longitudinal planning. The dataset is annotated with extensive professional skills, comprising over 677 meta-skills and 4577 atomic skills. \textbf{2) How to train a multi-therapy AI counselor?} While existing models often focus on a single therapy, complex cases frequently require flexible strategies among various therapies. We construct a diverse dataset covering five therapeutic modalities (Psychodynamic, Behaviorism, CBT, Humanistic Existentialist, and Postmodernist) alongside an integrative therapy with a unified three-stage clinical framework across six core psychological topics. \textbf{3) How to systematically evaluate an AI counselor?} We establish a holistic evaluation framework with 18 therapy-specific and therapy-shared metrics across Client-Level and Counselor-Level dimensions. To support this, we also construct over 2,000 diverse client profiles. Extensive experimental analysis fully validates the superior quality and clinical fidelity of our dataset. Crucially, \texttt{PsychEval} transcends static benchmarking to serve as a high-fidelity reinforcement learning environment that enables the self-evolutionary training of clinically responsible and adaptive AI counselors.

翻译：为开发可靠的AI心理评估系统，本文提出\texttt{PsychEval}——一个多会话、多疗法、高拟真度的基准测试框架，旨在解决三个关键挑战：\textbf{1) 能否训练出高拟真度的AI咨询师？} 真实的心理咨询是需持续记忆与动态目标追踪的纵向任务。我们构建了跨三个独立阶段（涵盖6-10次会话）的多会话基准，要求模型具备记忆连续性、自适应推理与长期规划等关键能力。数据集标注了涵盖677项元技能与4577项原子技能的广泛专业技能体系。\textbf{2) 如何训练多疗法AI咨询师？} 现有模型多聚焦单一疗法，但复杂案例常需跨疗法灵活调整策略。我们构建了涵盖五大治疗流派（精神动力学、行为主义、认知行为疗法、人本存在主义及后现代主义）的多样化数据集，并基于六类核心心理议题设计了具有统一三阶段临床框架的整合疗法。\textbf{3) 如何系统评估AI咨询师？} 我们建立了包含18项疗法专用及通用指标的全方位评估框架，覆盖来访者维度与咨询师维度。为此还构建了2000余个多样化来访者画像。大量实验分析充分验证了数据集的高质量与临床保真度。关键的是，\texttt{PsychEval}超越了静态基准测试，可作为高保真强化学习环境，支持具备临床责任感与自适应能力的AI咨询师实现自我进化训练。

相关内容

关注 7093

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

如何评估具身智能？斯坦福李飞飞等发布《BEHAVIOR-1K: 以人为中心、具身化AI基准测试，含1000种日常活动和真实模拟》

专知会员服务

62+阅读 · 2024年3月15日

科研动态| 不依赖GPT-4的多模态幻觉评估benchmark来了！一键实现多维度幻觉自动分析

专知会员服务

26+阅读 · 2023年11月15日