Introduction: Microblogging websites have massed rich data sources for sentiment analysis and opinion mining. In this regard, sentiment classification has frequently proven inefficient because microblog posts typically lack syntactically consistent terms and representatives since users on these social networks do not like to write lengthy statements. Also, there are some limitations to low-resource languages. The Persian language has exceptional characteristics and demands unique annotated data and models for the sentiment analysis task, which are distinctive from text features within the English dialect. Method: This paper first constructs a user opinion dataset called ITRC-Opinion in a collaborative environment and insource way. Our dataset contains 60,000 informal and colloquial Persian texts from social microblogs such as Twitter and Instagram. Second, this study proposes a new architecture based on the convolutional neural network (CNN) model for more effective sentiment analysis of colloquial text in social microblog posts. The constructed datasets are used to evaluate the presented architecture. Furthermore, some models, such as LSTM, CNN-RNN, BiLSTM, and BiGRU with different word embeddings, including Fasttext, Glove, and Word2vec, investigated our dataset and evaluated the results. Results: The results demonstrate the benefit of our dataset and the proposed model (72% accuracy), displaying meaningful improvement in sentiment classification performance.
翻译:引言:微博网站为情感分析和观点挖掘积累了丰富的数据来源。然而,情感分类常因微博帖子缺乏句法一致的术语和代表性而效率低下,因为社交网络用户不倾向于撰写长篇陈述。此外,低资源语言存在诸多限制。波斯语具有独特特征,需要专门标注的数据和模型进行情感分析,这与英语文本特征截然不同。方法:本文首先在协作环境和内包方式下构建了一个名为ITRC-Opinion的用户观点数据集。该数据集包含来自Twitter和Instagram等社交微博的60,000条非正式俚语波斯语文本。其次,本研究提出了一种基于卷积神经网络(CNN)模型的新架构,用于更有效地分析社交微博帖子中的俚语文本情感。利用构建的数据集对所提架构进行了评估。此外,采用不同词嵌入(包括Fasttext、Glove和Word2vec)的LSTM、CNN-RNN、BiLSTM和BiGRU等模型也考察了我们的数据集并评估了结果。结果:结果表明,我们的数据集和所提模型(准确率72%)在情感分类性能上取得了显著提升。