Sentiment classification is a fundamental task in natural language processing, assigning one of the three classes, positive, negative, or neutral, to free texts. However, sentiment classification models are highly domain dependent; the classifier may perform classification with reasonable accuracy in one domain but not in another due to the Semantic multiplicity of words getting poor accuracy. This article presents a new Persian/Arabic multi-domain sentiment analysis method using the cumulative weighted capsule networks approach. Weighted capsule ensemble consists of training separate capsule networks for each domain and a weighting measure called domain belonging degree (DBD). This criterion consists of TF and IDF, which calculates the dependency of each document for each domain separately; this value is multiplied by the possible output that each capsule creates. In the end, the sum of these multiplications is the title of the final output, and is used to determine the polarity. And the most dependent domain is considered the final output for each domain. The proposed method was evaluated using the Digikala dataset and obtained acceptable accuracy compared to the existing approaches. It achieved an accuracy of 0.89 on detecting the domain of belonging and 0.99 on detecting the polarity. Also, for the problem of dealing with unbalanced classes, a cost-sensitive function was used. This function was able to achieve 0.0162 improvements in accuracy for sentiment classification. This approach on Amazon Arabic data can achieve 0.9695 accuracies in domain classification.
翻译:情感分类是自然语言处理中的基础任务,旨在将自由文本划分为正面、负面或中性三类。然而,情感分类模型具有高度领域依赖性:由于词语的语义多义性,分类器在某一领域可能达到合理精度,但在另一领域则效果不佳。本文提出了一种基于累积加权胶囊网络方法的新型波斯语/阿拉伯语多领域情感分析技术。加权胶囊集成由针对每个领域单独训练的胶囊网络以及名为领域隶属度(DBD)的权重度量组成。该度量包含词频(TF)和逆文档频率(IDF),分别计算每个文档对每个领域的依赖程度;该值乘以每个胶囊生成的可能输出。最终,这些乘积的总和作为最终输出标题,用于确定情感极性。而依赖程度最高的领域被视为该领域的最终输出。采用Digikala数据集对所提方法进行评估,与现有方法相比取得了可接受的精度。在领域检测上达到0.89的准确率,在极性检测上达到0.99的准确率。此外,针对不平衡类别问题,使用了代价敏感函数,该函数在情感分类任务中实现了0.0162的精度提升。该方法在Amazon阿拉伯语数据集上进行领域分类时,准确率可达0.9695。