Exploring the Efficacy of Large Language Models in Summarizing Mental Health Counseling Sessions: A Benchmark Study

Prottay Kumar Adhikary,Aseem Srivastava,Shivani Kumar,Salam Michael Singh,Puneet Manuja,Jini K Gopinath,Vijay Krishnan,Swati Kedia,Koushik Sinha Deb,Tanmoy Chakraborty

Comprehensive summaries of sessions enable an effective continuity in mental health counseling, facilitating informed therapy planning. Yet, manual summarization presents a significant challenge, diverting experts' attention from the core counseling process. This study evaluates the effectiveness of state-of-the-art Large Language Models (LLMs) in selectively summarizing various components of therapy sessions through aspect-based summarization, aiming to benchmark their performance. We introduce MentalCLOUDS, a counseling-component guided summarization dataset consisting of 191 counseling sessions with summaries focused on three distinct counseling components (aka counseling aspects). Additionally, we assess the capabilities of 11 state-of-the-art LLMs in addressing the task of component-guided summarization in counseling. The generated summaries are evaluated quantitatively using standard summarization metrics and verified qualitatively by mental health professionals. Our findings demonstrate the superior performance of task-specific LLMs such as MentalLlama, Mistral, and MentalBART in terms of standard quantitative metrics such as Rouge-1, Rouge-2, Rouge-L, and BERTScore across all aspects of counseling components. Further, expert evaluation reveals that Mistral supersedes both MentalLlama and MentalBART based on six parameters -- affective attitude, burden, ethicality, coherence, opportunity costs, and perceived effectiveness. However, these models share the same weakness by demonstrating a potential for improvement in the opportunity costs and perceived effectiveness metrics.

翻译：全面的会话摘要能够确保心理健康咨询的连续性，从而促进有效的治疗规划。然而，手动摘要面临重大挑战，会分散专家对核心咨询过程的注意力。本研究通过基于方面的摘要方法，评估了最先进的大型语言模型（LLMs）在选择性总结治疗会话不同组成部分方面的有效性，旨在为其性能设立基准。我们引入了MentalCLOUDS，一个以咨询组成部分为导向的摘要数据集，包含191个咨询会话及其聚焦于三个不同咨询组成部分（即咨询方面）的摘要。此外，我们评估了11个最先进的大型语言模型在处理咨询中组成部分导向摘要任务中的能力。生成的摘要通过标准摘要指标进行定量评估，并由心理健康专业人员通过定性验证。研究结果表明，任务特异性大型语言模型（如MentalLlama、Mistral和MentalBART）在咨询组成部分的所有方面均表现出优越性能，依据标准定量指标（如Rouge-1、Rouge-2、Rouge-L和BERTScore）。此外，专家评估显示，Mistral在六个参数（情感态度、负担、伦理、连贯性、机会成本和感知有效性）上均优于MentalLlama和MentalBART。然而，这些模型存在共同的弱点，即在机会成本和感知有效性指标上仍有改进空间。