The analysis of consumer sentiment, as expressed through reviews, can provide a wealth of insight regarding the quality of a product. While the study of sentiment analysis has been widely explored in many popular languages, relatively less attention has been given to the Bangla language, mostly due to a lack of relevant data and cross-domain adaptability. To address this limitation, we present BanglaBook, a large-scale dataset of Bangla book reviews consisting of 158,065 samples classified into three broad categories: positive, negative, and neutral. We provide a detailed statistical analysis of the dataset and employ a range of machine learning models to establish baselines including SVM, LSTM, and Bangla-BERT. Our findings demonstrate a substantial performance advantage of pre-trained models over models that rely on manually crafted features, emphasizing the necessity for additional training resources in this domain. Additionally, we conduct an in-depth error analysis by examining sentiment unigrams, which may provide insight into common classification errors in under-resourced languages like Bangla. Our codes and data are publicly available at https://github.com/mohsinulkabir14/BanglaBook.
翻译:通过对评论文本中消费者情感的分析,可以深入了解产品质量。尽管情感分析研究已在多种主流语言中得到广泛探索,但针对孟加拉语的研究相对较少,主要原因是缺乏相关数据与跨领域适应性。为解决这一问题,我们提出了BanglaBook——一个包含158,065个样本的大规模孟加拉语书评数据集,样本被划分为正面、负面和中性三个大类。我们对该数据集进行了详细的统计分析,并采用包括SVM、LSTM和Bangla-BERT在内的多种机器学习模型建立基线。研究结果表明,预训练模型相较于依赖人工设计特征的方法具有显著性能优势,突显了该领域对额外训练资源的需求。此外,我们通过分析情感单元词(sentiment unigrams)进行了深入错误分析,这可为理解孟加拉语等资源匮乏语言的常见分类错误提供见解。我们的代码与数据已公开发布于 https://github.com/mohsinulkabir14/BanglaBook。