Large Language Models (LLMs) have demonstrated potential in predicting mental health outcomes from online text, yet traditional classification methods often lack interpretability and robustness. This study evaluates structured reasoning techniques-Chain-of-Thought (CoT), Self-Consistency (SC-CoT), and Tree-of-Thought (ToT)-to improve classification accuracy across multiple mental health datasets sourced from Reddit. We analyze reasoning-driven prompting strategies, including Zero-shot CoT and Few-shot CoT, using key performance metrics such as Balanced Accuracy, F1 score, and Sensitivity/Specificity. Our findings indicate that reasoning-enhanced techniques improve classification performance over direct prediction, particularly in complex cases. Compared to baselines such as Zero Shot non-CoT Prompting, and fine-tuned pre-trained transformers such as BERT and Mental-RoBerta, and fine-tuned Open Source LLMs such as Mental Alpaca and Mental-Flan-T5, reasoning-driven LLMs yield notable gains on datasets like Dreaddit (+0.52\% over M-LLM, +0.82\% over BERT) and SDCNL (+4.67\% over M-LLM, +2.17\% over BERT). However, performance declines in Depression Severity, and CSSRS predictions suggest dataset-specific limitations, likely due to our using a more extensive test set. Among prompting strategies, Few-shot CoT consistently outperforms others, reinforcing the effectiveness of reasoning-driven LLMs. Nonetheless, dataset variability highlights challenges in model reliability and interpretability. This study provides a comprehensive benchmark of reasoning-based LLM techniques for mental health text classification. It offers insights into their potential for scalable clinical applications while identifying key challenges for future improvements.
翻译:大语言模型在基于在线文本预测心理健康结果方面展现出潜力,但传统分类方法往往缺乏可解释性和鲁棒性。本研究评估了结构化推理技术——思维链、自洽性思维链和思维树——以提升基于Reddit多源心理健康数据集的分类准确率。我们使用平衡准确率、F1分数及敏感度/特异度等关键性能指标,分析了包括零样本思维链和少样本思维链在内的推理驱动提示策略。研究结果表明,推理增强技术相较于直接预测能提升分类性能,在复杂案例中尤为明显。与零样本非思维链提示、微调预训练Transformer模型(如BERT和Mental-RoBerta)以及微调开源大语言模型(如Mental Alpaca和Mental-Flan-T5)等基线方法相比,推理驱动的大语言模型在Dreaddit数据集上取得了显著提升(较M-LLM提升0.52%,较BERT提升0.82%),在SDCNL数据集上亦有明显进步(较M-LLM提升4.67%,较BERT提升2.17%)。然而,在抑郁严重程度和CSSRS预测任务中性能有所下降,这提示了数据集特定的局限性,可能源于我们使用了更广泛的测试集。在各类提示策略中,少样本思维链持续表现最优,进一步印证了推理驱动大语言模型的有效性。尽管如此,数据集的差异性凸显了模型可靠性与可解释性方面面临的挑战。本研究为基于推理的大语言模型心理健康文本分类技术建立了全面基准,既揭示了其在可扩展临床应用中的潜力,也指出了未来改进的关键挑战。