The widespread availability of code-mixed data can provide valuable insights into low-resource languages like Bengali, which have limited datasets. Sentiment analysis has been a fundamental text classification task across several languages for code-mixed data. However, there has yet to be a large-scale and diverse sentiment analysis dataset on code-mixed Bengali. We address this limitation by introducing BnSentMix, a sentiment analysis dataset on code-mixed Bengali consisting of 20,000 samples with 4 sentiment labels from Facebook, YouTube, and e-commerce sites. We ensure diversity in data sources to replicate realistic code-mixed scenarios. Additionally, we propose 14 baseline methods including novel transformer encoders further pre-trained on code-mixed Bengali-English, achieving an overall accuracy of 69.8% and an F1 score of 69.1% on sentiment classification tasks. Detailed analyses reveal variations in performance across different sentiment labels and text types, highlighting areas for future improvement.
翻译:混合编码数据的广泛可用性可为孟加拉语等低资源语言提供有价值的见解,这些语言的数据集较为有限。情感分析一直是跨多种语言的混合编码数据的基础文本分类任务。然而,目前尚缺乏一个大规模且多样化的孟加拉语混合编码情感分析数据集。我们通过引入BnSentMix来解决这一局限性,这是一个孟加拉语混合编码情感分析数据集,包含来自Facebook、YouTube和电子商务网站的20,000个样本,具有4种情感标签。我们确保数据来源的多样性,以复现真实的混合编码场景。此外,我们提出了14种基线方法,包括在孟加拉语-英语混合编码数据上进一步预训练的新型Transformer编码器,在情感分类任务上实现了69.8%的整体准确率和69.1%的F1分数。详细分析揭示了不同情感标签和文本类型之间的性能差异,突出了未来改进的方向。