PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification

This research introduces the first large-scale, well-balanced Persian social media text classification dataset, specifically designed to address the lack of comprehensive resources in this domain. The dataset comprises 36,000 posts across nine categories (Economic, Artistic, Sports, Political, Social, Health, Psychological, Historical, and Science & Technology), each containing 4,000 samples to ensure balanced class distribution. Data collection involved 60,000 raw posts from various Persian social media platforms, followed by rigorous preprocessing and hybrid annotation combining ChatGPT-based few-shot prompting with human verification. To mitigate class imbalance, we employed undersampling with semantic redundancy removal and advanced data augmentation strategies integrating lexical replacement and generative prompting. We benchmarked several models, including BiLSTM, XLM-RoBERTa (with LoRA and AdaLoRA adaptations), FaBERT, SBERT-based architectures, and the Persian-specific TookaBERT (Base and Large). Experimental results show that transformer-based models consistently outperform traditional neural networks, with TookaBERT-Large achieving the best performance (Precision: 0.9622, Recall: 0.9621, F1- score: 0.9621). Class-wise evaluation further confirms robust performance across all categories, though social and political texts exhibited slightly lower scores due to inherent ambiguity. This research presents a new high-quality dataset and provides comprehensive evaluations of cutting-edge models, establishing a solid foundation for further developments in Persian NLP, including trend analysis, social behavior modeling, and user classification. The dataset is publicly available to support future research endeavors.

翻译：本研究首次构建了大规模且类别平衡的波斯语社交媒体文本分类数据集，旨在填补该领域综合资源的匮乏。数据集包含36,000条帖子，涵盖9个类别（经济、艺术、体育、政治、社会、健康、心理、历史、科学技术），每个类别均有4,000个样本以确保类别分布均衡。数据采集源自多个波斯语社交媒体平台的60,000条原始帖子，经过严格预处理，并采用结合ChatGPT少样本提示与人工验证的混合标注方法。为缓解类别不平衡问题，我们采用了语义冗余去除下的欠采样策略，并融合了词汇替换与生成提示的高级数据增强技术。我们对包括BiLSTM、XLM-RoBERTa（采用LoRA与AdaLoRA适配）、FaBERT、基于SBERT的架构以及波斯语专用TookaBERT（Base与Large版本）在内的多个模型进行了基准测试。实验结果表明，基于Transformer的模型持续优于传统神经网络，其中TookaBERT-Large取得了最佳性能（精确率：0.9622，召回率：0.9621，F1值：0.9621）。按类别评估进一步证实了所有类别的稳健表现，但社会与政治文本因固有歧义性得分略低。本研究提供了高质量的新型数据集，并对前沿模型进行了全面评估，为波斯语自然语言处理在趋势分析、社会行为建模及用户分类等领域的进一步发展奠定了坚实基础。该数据集已公开，以支持后续研究。