The analysis of consumer sentiment, as expressed through reviews, can provide a wealth of insight regarding the quality of a product. While the study of sentiment analysis has been widely explored in many popular languages, relatively less attention has been given to the Bangla language, mostly due to a lack of relevant data and cross-domain adaptability. To address this limitation, we present BanglaBook, a large-scale dataset of Bangla book reviews consisting of 158,065 samples classified into three broad categories: positive, negative, and neutral. We provide a detailed statistical analysis of the dataset and employ a range of machine learning models to establish baselines including SVM, LSTM, and Bangla-BERT. Our findings demonstrate a substantial performance advantage of pre-trained models over models that rely on manually crafted features, emphasizing the necessity for additional training resources in this domain. Additionally, we conduct an in-depth error analysis by examining sentiment unigrams, which may provide insight into common classification errors in under-resourced languages like Bangla. Our codes and data are publicly available at https://github.com/mohsinulkabir14/BanglaBook.
翻译:通过对消费者在评论中表达的情感进行分析,可以深入了解产品质量。尽管情感分析研究已在许多主流语言中得到广泛探索,但针对孟加拉语的相关研究相对较少,这主要是由于缺乏相关数据及跨领域适应性。为解决这一局限,我们提出了BanglaBook——一个包含158,065个样本的大规模孟加拉语书评数据集,这些样本被分为正面、负面和中性三大类别。我们对数据集进行了详细的统计分析,并采用多种机器学习模型建立基线,包括SVM、LSTM和Bangla-BERT。研究发现,预训练模型相比依赖手工特征构建的模型具有显著的性能优势,凸显了在该领域增加训练资源的必要性。此外,我们通过检查情感单元词(sentiment unigrams)进行了深入的错误分析,这或许能为理解孟加拉语等资源匮乏语言中的常见分类错误提供洞见。我们的代码和数据已在https://github.com/mohsinulkabir14/BanglaBook公开。