Automatic detection of depression is a rapidly growing field of research at the intersection of psychology and machine learning. However, with its exponential interest comes a growing concern for data privacy and scarcity due to the sensitivity of such a topic. In this paper, we propose a pipeline for Large Language Models (LLMs) to generate synthetic data to improve the performance of depression prediction models. Starting from unstructured, naturalistic text data from recorded transcripts of clinical interviews, we utilize an open-source LLM to generate synthetic data through chain-of-thought prompting. This pipeline involves two key steps: the first step is the generation of the synopsis and sentiment analysis based on the original transcript and depression score, while the second is the generation of the synthetic synopsis/sentiment analysis based on the summaries generated in the first step and a new depression score. Not only was the synthetic data satisfactory in terms of fidelity and privacy-preserving metrics, it also balanced the distribution of severity in the training dataset, thereby significantly enhancing the model's capability in predicting the intensity of the patient's depression. By leveraging LLMs to generate synthetic data that can be augmented to limited and imbalanced real-world datasets, we demonstrate a novel approach to addressing data scarcity and privacy concerns commonly faced in automatic depression detection, all while maintaining the statistical integrity of the original dataset. This approach offers a robust framework for future mental health research and applications.
翻译:抑郁症自动检测是心理学与机器学习交叉领域中一个快速发展的研究方向。然而,随着该领域关注度的指数级增长,由于话题的敏感性,数据隐私与稀缺性问题日益凸显。本文提出一种利用大语言模型生成合成数据的流程,以提升抑郁症预测模型的性能。我们从临床访谈录音转录的非结构化自然文本数据出发,通过思维链提示技术,利用开源LLM生成合成数据。该流程包含两个关键步骤:第一步是基于原始转录文本和抑郁评分生成摘要并进行情感分析;第二步是基于第一步生成的摘要及新的抑郁评分,生成合成的摘要/情感分析。合成数据不仅在保真度和隐私保护指标方面表现良好,还平衡了训练数据集中抑郁严重程度的分布,从而显著提升了模型预测患者抑郁强度的能力。通过利用LLM生成可增强有限且不平衡真实数据集的合成数据,我们展示了一种解决抑郁症自动检测中常见数据稀缺和隐私问题的新方法,同时保持了原始数据集的统计完整性。该方法为未来心理健康研究和应用提供了稳健的框架。