PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification

This research introduces the first large-scale, well-balanced Persian social media text classification dataset, specifically designed to address the lack of comprehensive resources in this domain. The dataset comprises 36,000 posts across nine categories (Economic, Artistic, Sports, Political, Social, Health, Psychological, Historical, and Science & Technology), each containing 4,000 samples to ensure balanced class distribution. Data collection involved 60,000 raw posts from various Persian social media platforms, followed by rigorous preprocessing and hybrid annotation combining ChatGPT-based few-shot prompting with human verification. To mitigate class imbalance, we employed undersampling with semantic redundancy removal and advanced data augmentation strategies integrating lexical replacement and generative prompting. We benchmarked several models, including BiLSTM, XLM-RoBERTa (with LoRA and AdaLoRA adaptations), FaBERT, SBERT-based architectures, and the Persian-specific TookaBERT (Base and Large). Experimental results show that transformer-based models consistently outperform traditional neural networks, with TookaBERT-Large achieving the best performance (Precision: 0.9622, Recall: 0.9621, F1- score: 0.9621). Class-wise evaluation further confirms robust performance across all categories, though social and political texts exhibited slightly lower scores due to inherent ambiguity. This research presents a new high-quality dataset and provides comprehensive evaluations of cutting-edge models, establishing a solid foundation for further developments in Persian NLP, including trend analysis, social behavior modeling, and user classification. The dataset is publicly available to support future research endeavors.

翻译：本研究首次提出了一个大规模、均衡的波斯语社交媒体文本分类数据集，旨在解决该领域缺乏综合性资源的问题。该数据集包含九个类别（经济、艺术、体育、政治、社会、健康、心理、历史以及科学技术），共计36,000条帖子，每个类别包含4,000个样本，以确保类别分布的平衡。数据收集涉及从多个波斯语社交媒体平台获取的60,000条原始帖子，随后进行了严格的预处理以及结合基于ChatGPT的少样本提示与人工验证的混合标注。为缓解类别不平衡问题，我们采用了基于语义冗余消除的下采样方法，以及融合词汇替换与生成式提示的先进数据增强策略。我们对多种模型进行了基准测试，包括BiLSTM、XLM-RoBERTa（结合LoRA与AdaLoRA适配）、FaBERT、基于SBERT的架构以及波斯语专用的TookaBERT（Base版与Large版）。实验结果表明，基于Transformer的模型始终优于传统神经网络，其中TookaBERT-Large取得了最佳性能（精确率：0.9622，召回率：0.9621，F1分数：0.9621）。按类别评估进一步证实了模型在所有类别上的稳健性能，尽管社会与政治类文本因固有的模糊性而表现出略低的分数。本研究不仅提供了一个高质量的新数据集，还对前沿模型进行了全面评估，为波斯语自然语言处理的进一步发展（包括趋势分析、社会行为建模和用户分类）奠定了坚实基础。该数据集已公开提供，以支持未来的研究工作。