Large language models (LLMs) hold potential for mental healthcare applications, particularly in cognitive behavioral therapy (CBT)-based counseling, where reward models play a critical role in aligning LLMs with preferred therapeutic behaviors. However, existing reward model evaluations often fail to capture alignment effectiveness in long-horizon interventions due to limited coverage of process-oriented datasets and misalignment between evaluation targets and psychological alignment objectives. To address these limitations, we present PRMB, a comprehensive benchmark tailored for evaluating reward models in multi-session CBT counseling. PRMB spans 6 sessions and 21 diverse negative scenarios, incorporating both pairwise and Best-of-N preference evaluations. We demonstrate a positive correlation between our benchmark and downstream counseling dialogue performance. Based on our benchmark, we conduct extensive analysis on the state-of-the-art reward models, revealing their generalization defects that were not discovered by previous benchmarks and highlighting the potential of generative reward models. Furthermore, we delve into examining the effectiveness of inference-time strategy for the evaluation of reward models and analyzing the impact factors of generative reward models. This work advances intelligent informatics for personalized healthcare by establishing a framework for reward model assessment in mental health dialogues. Evaluation code and datasets are publicly available at https://github.com/YouKenChaw/PRMB
翻译:大型语言模型(LLMs)在心理健康护理领域具有应用潜力,尤其在基于认知行为治疗(CBT)的咨询对话中,奖励模型对于使LLMs与期望的治疗行为对齐起着关键作用。然而,现有的奖励模型评估常因过程导向数据集的覆盖不足,以及评估目标与心理对齐目标之间的错位,而难以捕捉其在长程干预中的对齐有效性。为应对这些局限,我们提出了PRMB——一个专为多轮次CBT咨询中奖励模型评估而设计的综合性基准。PRMB涵盖6个咨询轮次和21种不同的负面情境,整合了成对偏好评估与Best-of-N偏好评估。我们证明了该基准与下游咨询对话性能之间存在正相关关系。基于此基准,我们对当前最先进的奖励模型进行了广泛分析,揭示了先前基准未能发现的泛化缺陷,并凸显了生成式奖励模型的潜力。此外,我们深入探讨了推理时策略在奖励模型评估中的有效性,并分析了生成式奖励模型的影响因素。本研究通过建立心理健康对话中奖励模型评估的框架,推动了面向个性化医疗的智能信息学发展。评估代码与数据集已在 https://github.com/YouKenChaw/PRMB 公开提供。