Synthetic Data Generation with LLM for Improved Depression Prediction

Automatic detection of depression is a rapidly growing field of research at the intersection of psychology and machine learning. However, with its exponential interest comes a growing concern for data privacy and scarcity due to the sensitivity of such a topic. In this paper, we propose a pipeline for Large Language Models (LLMs) to generate synthetic data to improve the performance of depression prediction models. Starting from unstructured, naturalistic text data from recorded transcripts of clinical interviews, we utilize an open-source LLM to generate synthetic data through chain-of-thought prompting. This pipeline involves two key steps: the first step is the generation of the synopsis and sentiment analysis based on the original transcript and depression score, while the second is the generation of the synthetic synopsis/sentiment analysis based on the summaries generated in the first step and a new depression score. Not only was the synthetic data satisfactory in terms of fidelity and privacy-preserving metrics, it also balanced the distribution of severity in the training dataset, thereby significantly enhancing the model's capability in predicting the intensity of the patient's depression. By leveraging LLMs to generate synthetic data that can be augmented to limited and imbalanced real-world datasets, we demonstrate a novel approach to addressing data scarcity and privacy concerns commonly faced in automatic depression detection, all while maintaining the statistical integrity of the original dataset. This approach offers a robust framework for future mental health research and applications.

翻译：抑郁症自动检测是心理学与机器学习交叉领域中一个快速发展的研究方向。然而，随着该领域关注度的指数级增长，由于话题的敏感性，数据隐私与稀缺性问题日益凸显。本文提出一种利用大语言模型生成合成数据的流程，以提升抑郁症预测模型的性能。我们从临床访谈录音转录的非结构化自然文本数据出发，通过思维链提示技术，利用开源LLM生成合成数据。该流程包含两个关键步骤：第一步是基于原始转录文本和抑郁评分生成摘要并进行情感分析；第二步是基于第一步生成的摘要及新的抑郁评分，生成合成的摘要/情感分析。合成数据不仅在保真度和隐私保护指标方面表现良好，还平衡了训练数据集中抑郁严重程度的分布，从而显著提升了模型预测患者抑郁强度的能力。通过利用LLM生成可增强有限且不平衡真实数据集的合成数据，我们展示了一种解决抑郁症自动检测中常见数据稀缺和隐私问题的新方法，同时保持了原始数据集的统计完整性。该方法为未来心理健康研究和应用提供了稳健的框架。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日