In the realm of social media, users frequently convey personal sentiments, with some potentially indicating cognitive distortions or suicidal tendencies. Timely recognition of such signs is pivotal for effective interventions. In response, we introduce two novel annotated datasets from Chinese social media, focused on cognitive distortions and suicidal risk classification. We propose a comprehensive benchmark using both supervised learning and large language models, especially from the GPT series, to evaluate performance on these datasets. To assess the capabilities of the large language models, we employed three strategies: zero-shot, few-shot, and fine-tuning. Furthermore, we deeply explored and analyzed the performance of these large language models from a psychological perspective, shedding light on their strengths and limitations in identifying and understanding complex human emotions. Our evaluations underscore a performance difference between the two approaches, with the models often challenged by subtle category distinctions. While GPT-4 consistently delivered strong results, GPT-3.5 showed marked improvement in suicide risk classification after fine-tuning. This research is groundbreaking in its evaluation of large language models for Chinese social media tasks, accentuating the models' potential in psychological contexts. All datasets and code are made available.
翻译:在社交媒体领域,用户频繁表达个人情感,其中部分内容可能暗示认知扭曲或自杀倾向。及时识别此类信号对有效干预至关重要。为此,我们引入两个来自中文社交媒体的新型标注数据集,专注于认知扭曲分类与自杀风险分级。我们提出一个综合性基准框架,同时采用监督学习方法和大型语言模型(尤其是GPT系列)来评估这些数据集上的性能。为评估大型语言模型的能力,我们采用了零样本、少样本及微调三种策略。此外,我们从心理学视角深入探讨并分析了这些语言模型的表现,揭示了它们在识别和理解复杂人类情感方面的优势与局限。评估结果显示两种方法存在性能差异,模型常因类别细微区分而面临挑战。尽管GPT-4持续展现优异表现,GPT-3.5在微调后于自杀风险分类任务中取得显著提升。本研究开创性地评估了大型语言模型在中文社交媒体任务中的表现,凸显了这些模型在心理学语境中的应用潜力。所有数据集及代码均已开源提供。