Cognitive-Mental-LLM: Evaluating Reasoning in Large Language Models for Mental Health Prediction via Online Text

Large Language Models (LLMs) have demonstrated potential in predicting mental health outcomes from online text, yet traditional classification methods often lack interpretability and robustness. This study evaluates structured reasoning techniques-Chain-of-Thought (CoT), Self-Consistency (SC-CoT), and Tree-of-Thought (ToT)-to improve classification accuracy across multiple mental health datasets sourced from Reddit. We analyze reasoning-driven prompting strategies, including Zero-shot CoT and Few-shot CoT, using key performance metrics such as Balanced Accuracy, F1 score, and Sensitivity/Specificity. Our findings indicate that reasoning-enhanced techniques improve classification performance over direct prediction, particularly in complex cases. Compared to baselines such as Zero Shot non-CoT Prompting, and fine-tuned pre-trained transformers such as BERT and Mental-RoBerta, and fine-tuned Open Source LLMs such as Mental Alpaca and Mental-Flan-T5, reasoning-driven LLMs yield notable gains on datasets like Dreaddit (+0.52\% over M-LLM, +0.82\% over BERT) and SDCNL (+4.67\% over M-LLM, +2.17\% over BERT). However, performance declines in Depression Severity, and CSSRS predictions suggest dataset-specific limitations, likely due to our using a more extensive test set. Among prompting strategies, Few-shot CoT consistently outperforms others, reinforcing the effectiveness of reasoning-driven LLMs. Nonetheless, dataset variability highlights challenges in model reliability and interpretability. This study provides a comprehensive benchmark of reasoning-based LLM techniques for mental health text classification. It offers insights into their potential for scalable clinical applications while identifying key challenges for future improvements.

翻译：大语言模型在基于在线文本预测心理健康结果方面展现出潜力，但传统分类方法往往缺乏可解释性和鲁棒性。本研究评估了结构化推理技术——思维链、自洽性思维链和思维树——以提升基于Reddit多源心理健康数据集的分类准确率。我们使用平衡准确率、F1分数及敏感度/特异度等关键性能指标，分析了包括零样本思维链和少样本思维链在内的推理驱动提示策略。研究结果表明，推理增强技术相较于直接预测能提升分类性能，在复杂案例中尤为明显。与零样本非思维链提示、微调预训练Transformer模型（如BERT和Mental-RoBerta）以及微调开源大语言模型（如Mental Alpaca和Mental-Flan-T5）等基线方法相比，推理驱动的大语言模型在Dreaddit数据集上取得了显著提升（较M-LLM提升0.52%，较BERT提升0.82%），在SDCNL数据集上亦有明显进步（较M-LLM提升4.67%，较BERT提升2.17%）。然而，在抑郁严重程度和CSSRS预测任务中性能有所下降，这提示了数据集特定的局限性，可能源于我们使用了更广泛的测试集。在各类提示策略中，少样本思维链持续表现最优，进一步印证了推理驱动大语言模型的有效性。尽管如此，数据集的差异性凸显了模型可靠性与可解释性方面面临的挑战。本研究为基于推理的大语言模型心理健康文本分类技术建立了全面基准，既揭示了其在可扩展临床应用中的潜力，也指出了未来改进的关键挑战。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日