Large language models, particularly those akin to the rapidly progressing GPT series, are gaining traction for their expansive influence. While there is keen interest in their applicability within medical domains such as psychology, tangible explorations on real-world data remain scant. Concurrently, users on social media platforms are increasingly vocalizing personal sentiments; under specific thematic umbrellas, these sentiments often manifest as negative emotions, sometimes escalating to suicidal inclinations. Timely discernment of such cognitive distortions and suicidal risks is crucial to effectively intervene and potentially avert dire circumstances. Our study ventured into this realm by experimenting on two pivotal tasks: suicidal risk and cognitive distortion identification on Chinese social media platforms. Using supervised learning as a baseline, we examined and contrasted the efficacy of large language models via three distinct strategies: zero-shot, few-shot, and fine-tuning. Our findings revealed a discernible performance gap between the large language models and traditional supervised learning approaches, primarily attributed to the models' inability to fully grasp subtle categories. Notably, while GPT-4 outperforms its counterparts in multiple scenarios, GPT-3.5 shows significant enhancement in suicide risk classification after fine-tuning. To our knowledge, this investigation stands as the maiden attempt at gauging large language models on Chinese social media tasks. This study underscores the forward-looking and transformative implications of using large language models in the field of psychology. It lays the groundwork for future applications in psychological research and practice.
翻译:大语言模型,尤其是类似于快速发展的GPT系列模型,因其广泛的影响力而日益受到关注。尽管心理学等医学领域对其适用性存在浓厚兴趣,但基于真实世界数据的实际探索仍然稀少。与此同时,社交媒体平台上的用户越来越频繁地表达个人情感;在特定主题背景下,这些情感常表现为负面情绪,有时甚至升级为自杀倾向。及时识别此类认知扭曲和自杀风险对于有效干预并可能避免严重后果至关重要。本研究通过实验探索两个关键任务——中文社交媒体平台上的自杀风险与认知扭曲识别——来涉足这一领域。以监督学习为基线,我们通过三种不同策略(零样本、少样本和微调)检验并对比了大语言模型的效能。研究发现,大语言模型与传统监督学习方法之间存在明显性能差距,主要归因于模型无法充分理解微妙的类别。值得注意的是,虽然GPT-4在多种场景下优于其同类模型,但GPT-3.5在微调后显著提升了自杀风险分类能力。据我们所知,本研究是首次尝试评估大语言模型在中文社交媒体任务上的表现。这项研究强调了在心理学领域使用大语言模型的前瞻性和变革性意义,为未来在心理研究与实践中的应用奠定了基础。