Mental health in children and adolescents has been steadily deteriorating over the past few years. The recent advent of Large Language Models (LLMs) offers much hope for cost and time efficient scaling of monitoring and intervention, yet despite specifically prevalent issues such as school bullying and eating disorders, previous studies on have not investigated performance in this domain or for open information extraction where the set of answers is not predetermined. We create a new dataset of Reddit posts from adolescents aged 12-19 annotated by expert psychiatrists for the following categories: TRAUMA, PRECARITY, CONDITION, SYMPTOMS, SUICIDALITY and TREATMENT and compare expert labels to annotations from two top performing LLMs (GPT3.5 and GPT4). In addition, we create two synthetic datasets to assess whether LLMs perform better when annotating data as they generate it. We find GPT4 to be on par with human inter-annotator agreement and performance on synthetic data to be substantially higher, however we find the model still occasionally errs on issues of negation and factuality and higher performance on synthetic data is driven by greater complexity of real data rather than inherent advantage.
翻译:过去几年中,儿童和青少年的心理健康状况持续恶化。大型语言模型的最新发展为实现低成本、高效率的规模化监测与干预带来了巨大希望,然而针对校园霸凌、饮食失调等普遍存在的问题,此前研究尚未探讨该领域模型的表现,或涉及答案未预设的开放式信息抽取任务。我们构建了一个包含12至19岁青少年Reddit用户帖子的新数据集,由专业精神科医生标注以下类别:创伤、不稳定状态、症状、自杀倾向和治疗,并将专家标注与两种顶尖大型语言模型(GPT3.5和GPT4)的标注结果进行对比。此外,我们创建了两个合成数据集以评估模型在其生成数据上的标注表现。研究发现:GPT4在标注一致性上已达到人类专家间信度水平,且模型在合成数据上的表现显著更优;然而模型在否定性判断与事实性核查方面仍偶有失误,且合成数据上的更高性能源于真实数据更高的复杂度,而非模型的固有优势。