Large Language Models (LLMs) have demonstrated potential in predicting mental health outcomes from online text, yet traditional classification methods often lack interpretability and robustness. This study evaluates structured reasoning techniques-Chain-of-Thought (CoT), Self-Consistency (SC-CoT), and Tree-of-Thought (ToT)-to improve classification accuracy across multiple mental health datasets sourced from Reddit. We analyze reasoning-driven prompting strategies, including Zero-shot CoT and Few-shot CoT, using key performance metrics such as Balanced Accuracy, F1 score, and Sensitivity/Specificity. Our findings indicate that reasoning-enhanced techniques improve classification performance over direct prediction, particularly in complex cases. Compared to baselines such as Zero Shot non-CoT Prompting, and fine-tuned pre-trained transformers such as BERT and Mental-RoBerta, and fine-tuned Open Source LLMs such as Mental Alpaca and Mental-Flan-T5, reasoning-driven LLMs yield notable gains on datasets like Dreaddit (+0.52\% over M-LLM, +0.82\% over BERT) and SDCNL (+4.67\% over M-LLM, +2.17\% over BERT). However, performance declines in Depression Severity, and CSSRS predictions suggest dataset-specific limitations, likely due to our using a more extensive test set. Among prompting strategies, Few-shot CoT consistently outperforms others, reinforcing the effectiveness of reasoning-driven LLMs. Nonetheless, dataset variability highlights challenges in model reliability and interpretability. This study provides a comprehensive benchmark of reasoning-based LLM techniques for mental health text classification. It offers insights into their potential for scalable clinical applications while identifying key challenges for future improvements.
翻译:大型语言模型(LLMs)在基于在线文本预测心理健康结果方面展现出潜力,但传统分类方法往往缺乏可解释性和鲁棒性。本研究评估了结构化推理技术——思维链(CoT)、自洽性思维链(SC-CoT)和思维树(ToT)——以提升对来自Reddit的多个心理健康数据集的分类准确性。我们使用平衡准确率、F1分数和敏感度/特异度等关键性能指标,分析了包括零样本CoT和少样本CoT在内的推理驱动提示策略。研究结果表明,推理增强技术相较于直接预测能提升分类性能,尤其在复杂案例中。与基线方法(如零样本非CoT提示、微调的预训练Transformer模型(如BERT和Mental-RoBerta)以及微调的开源LLMs(如Mental Alpaca和Mental-Flan-T5))相比,推理驱动的LLMs在Dreaddit(较M-LLM提升0.52%,较BERT提升0.82%)和SDCNL(较M-LLM提升4.67%,较BERT提升2.17%)等数据集上取得了显著增益。然而,在抑郁严重度和CSSRS预测任务中性能有所下降,表明存在数据集特定的局限性,这可能源于我们使用了更广泛的测试集。在提示策略中,少样本CoT持续优于其他方法,进一步印证了推理驱动LLMs的有效性。尽管如此,数据集的变异性凸显了模型可靠性和可解释性方面的挑战。本研究为基于推理的LLM技术在心理健康文本分类领域提供了全面的基准测试,既揭示了其在可扩展临床应用中的潜力,也指出了未来改进的关键挑战。