While Bangla is considered a language with limited resources, sentiment analysis has been a subject of extensive research in the literature. Nevertheless, there is a scarcity of exploration into sentiment analysis specifically in the realm of noisy Bangla texts. In this paper, we introduce a dataset (NC-SentNoB) that we annotated manually to identify ten different types of noise found in a pre-existing sentiment analysis dataset comprising of around 15K noisy Bangla texts. At first, given an input noisy text, we identify the noise type, addressing this as a multi-label classification task. Then, we introduce baseline noise reduction methods to alleviate noise prior to conducting sentiment analysis. Finally, we assess the performance of fine-tuned sentiment analysis models with both noisy and noise-reduced texts to make comparisons. The experimental findings indicate that the noise reduction methods utilized are not satisfactory, highlighting the need for more suitable noise reduction methods in future research endeavors. We have made the implementation and dataset presented in this paper publicly available at https://github.com/ktoufiquee/A-Comparative-Analysis-of-Noise-Reduction-Methods-in-Sentiment-Analysis-on-Noisy-Bangla-Texts
翻译:尽管孟加拉语被视为资源有限的语言,但情感分析一直是文献中广泛研究的课题。然而,针对噪声孟加拉文本的情感分析探索仍相对匮乏。本文引入了一个人工标注的数据集(NC-SentNoB),用于识别现有约1.5万条噪声孟加拉文本情感分析数据集中十种不同类型的噪声。首先,对于输入的噪声文本,我们将其作为多标签分类任务来识别噪声类型;其次,引入基线降噪方法以在情感分析前减轻噪声影响;最后,评估微调后的情感分析模型在噪声文本与降噪文本上的表现并进行比较。实验结果表明,所采用的降噪方法效果并不理想,凸显了未来研究中对更合适降噪方法的需求。本文提出的实现代码与数据集已公开于https://github.com/ktoufiquee/A-Comparative-Analysis-of-Noise-Reduction-Methods-in-Sentiment-Analysis-on-Noisy-Bangla-Texts。