Sentiment analysis, also referred to as opinion mining, primarily tries to extract opinion from any text-based data. In the context of movie reviews and critics, sentimental analysis can be a helpful tool to predict whether a movie review is generally positive or negative. It can be difficult for the ML models to understand the context or metaphysical sentiment accurately, as ML models rely largely on statistical word representations. The objective of this paper is to examine and categorise movie reviews into positive and negative sentiments. Diverse machine learning models are considered in doing so, and Natural Language Processing (NLP) methodologies are employed for data preprocessing and model assessment. The IMDb dataset is used. Specifically, Naive Bayes, Logistic Regression, Support Vector Machines (SVM), LightGBM, LSTM, and transformer-based models such as RoBERTa and DistilBERT were evaluated. After a lot of testing with accuracy, precision, recall, F1-score, and ROC-AUC, RoBERTa performed better than all the other models, with an accuracy of 93.02%. A soft voting ensemble that combined all the models also improved classification performance, showing that model ensembling works well for sentiment analysis.
翻译:情感分析,又称意见挖掘,主要致力于从文本数据中提取观点。在电影评论与批评的语境下,情感分析可作为一种有效工具,预测电影评论整体呈正面还是负面情绪。由于机器学习模型主要依赖于统计性词语表征,因此要准确理解语境或形而上的情感较为困难。本文旨在对电影评论进行正面与负面情感的识别与分类。为此,本文考察了多种机器学习模型,并采用自然语言处理方法进行数据预处理与模型评估。实验基于IMDb数据集,具体评估了朴素贝叶斯、逻辑回归、支持向量机、LightGBM、长短期记忆网络以及基于Transformer的模型(如RoBERTa和DistilBERT)。经过大量针对准确率、精确率、召回率、F1分数及ROC-AUC指标的测试,RoBERTa以93.02%的准确率优于所有其他模型。此外,融合所有模型的软投票集成方法进一步提升了分类性能,表明模型集成在情感分析任务中具有良好效果。