Mental Health Diagnosis in the Digital Age: Harnessing Sentiment Analysis on Social Media Platforms upon Ultra-Sparse Feature Content

Amid growing global mental health concerns, particularly among vulnerable groups, natural language processing offers a tremendous potential for early detection and intervention of people's mental disorders via analyzing their postings and discussions on social media platforms. However, ultra-sparse training data, often due to vast vocabularies and low-frequency words, hinders the analysis accuracy. Multi-labeling and Co-occurrences of symptoms may also blur the boundaries in distinguishing similar/co-related disorders. To address these issues, we propose a novel semantic feature preprocessing technique with a three-folded structure: 1) mitigating the feature sparsity with a weak classifier, 2) adaptive feature dimension with modulus loops, and 3) deep-mining and extending features among the contexts. With enhanced semantic features, we train a machine learning model to predict and classify mental disorders. We utilize the Reddit Mental Health Dataset 2022 to examine conditions such as Anxiety, Borderline Personality Disorder (BPD), and Bipolar-Disorder (BD) and present solutions to the data sparsity challenge, highlighted by 99.81% non-zero elements. After applying our preprocessing technique, the feature sparsity decreases to 85.4%. Overall, our methods, when compared to seven benchmark models, demonstrate significant performance improvements: 8.0% in accuracy, 0.069 in precision, 0.093 in recall, 0.102 in F1 score, and 0.059 in AUC. This research provides foundational insights for mental health prediction and monitoring, providing innovative solutions to navigate challenges associated with ultra-sparse data feature and intricate multi-label classification in the domain of mental health analysis.

翻译：在全球心理健康问题日益严峻的背景下，尤其是对弱势群体而言，自然语言处理技术通过分析用户在社交媒体平台上的发帖与讨论，为早期检测和干预心理障碍提供了巨大潜力。然而，由于词汇量庞大和低频词的存在，超稀疏训练数据常常阻碍分析精度。症状的多标签关联与共现现象也可能模糊相似/相关障碍的边界。针对这些问题，我们提出了一种新型的三重语义特征预处理技术：1）通过弱分类器缓解特征稀疏性，2）基于模数循环实现自适应特征维度，3）在上下文语境中深度挖掘并扩展特征。基于增强后的语义特征，我们训练机器学习模型以预测和分类心理障碍。利用Reddit心理健康数据集2022（其中包含99.81%的非零元素），我们研究了焦虑症、边缘型人格障碍（BPD）和双相障碍（BD）等病症，并提出了应对数据稀疏挑战的解决方案。应用预处理技术后，特征稀疏度降低至85.4%。总体而言，与七种基准模型相比，我们的方法在性能上取得了显著提升：准确率提高8.0%，精确率提高0.069，召回率提高0.093，F1分数提高0.102，AUC提高0.059。本研究为心理健康预测与监测提供了基础性洞见，并为解决心理健康分析领域中超稀疏数据特征与复杂多标签分类问题提出了创新性方案。