Enhancing Depressive Post Detection in Bangla: A Comparative Study of TF-IDF, BERT and FastText Embeddings

Due to massive adoption of social media, detection of users' depression through social media analytics bears significant importance, particularly for underrepresented languages, such as Bangla. This study introduces a well-grounded approach to identify depressive social media posts in Bangla, by employing advanced natural language processing techniques. The dataset used in this work, annotated by domain experts, includes both depressive and non-depressive posts, ensuring high-quality data for model training and evaluation. To address the prevalent issue of class imbalance, we utilised random oversampling for the minority class, thereby enhancing the model's ability to accurately detect depressive posts. We explored various numerical representation techniques, including Term Frequency-Inverse Document Frequency (TF-IDF), Bidirectional Encoder Representations from Transformers (BERT) embedding and FastText embedding, by integrating them with a deep learning-based Convolutional Neural Network-Bidirectional Long Short-Term Memory (CNN-BiLSTM) model. The results obtained through extensive experimentation, indicate that the BERT approach performed better the others, achieving a F1-score of 84%. This indicates that BERT, in combination with the CNN-BiLSTM architecture, effectively recognises the nuances of Bangla texts relevant to depressive contents. Comparative analysis with the existing state-of-the-art methods demonstrates that our approach with BERT embedding performs better than others in terms of evaluation metrics and the reliability of dataset annotations. Our research significantly contribution to the development of reliable tools for detecting depressive posts in the Bangla language. By highlighting the efficacy of different embedding techniques and deep learning models, this study paves the way for improved mental health monitoring through social media platforms.

翻译：由于社交媒体的广泛普及，通过社交媒体分析检测用户抑郁状态具有重大意义，尤其对于孟加拉语等代表性不足的语言。本研究提出了一种基于先进自然语言处理技术的可靠方法，用于识别孟加拉语社交媒体中的抑郁帖文。本工作使用的数据集由领域专家标注，包含抑郁与非抑郁帖文，为模型训练与评估提供了高质量数据。针对普遍存在的类别不平衡问题，我们对少数类采用随机过采样技术，从而提升模型准确检测抑郁帖文的能力。我们探索了多种数值表示技术，包括词频-逆文档频率（TF-IDF）、基于Transformer的双向编码器表示（BERT）嵌入和FastText嵌入，并将其与基于深度学习的卷积神经网络-双向长短期记忆（CNN-BiLSTM）模型相结合。通过大量实验获得的结果表明，BERT方法的性能优于其他技术，取得了84%的F1分数。这证明BERT与CNN-BiLSTM架构的结合能有效识别孟加拉语文本中与抑郁内容相关的细微特征。与现有前沿方法的比较分析显示，采用BERT嵌入的我们的方法在评估指标和数据集标注可靠性方面均表现更优。本研究为开发可靠的孟加拉语抑郁帖文检测工具作出了重要贡献。通过揭示不同嵌入技术与深度学习模型的有效性，本研究为通过社交媒体平台改善心理健康监测开辟了新途径。