Evaluating the Effectiveness of Pre-trained Language Models in Predicting the Helpfulness of Online Product Reviews

Businesses and customers can gain valuable information from product reviews. The sheer number of reviews often necessitates ranking them based on their potential helpfulness. However, only a few reviews ever receive any helpfulness votes on online marketplaces. Sorting all reviews based on the few existing votes can cause helpful reviews to go unnoticed because of the limited attention span of readers. The problem of review helpfulness prediction is even more important for higher review volumes, and newly written reviews or launched products. In this work we compare the use of RoBERTa and XLM-R language models to predict the helpfulness of online product reviews. The contributions of our work in relation to literature include extensively investigating the efficacy of state-of-the-art language models -- both monolingual and multilingual -- against a robust baseline, taking ranking metrics into account when assessing these approaches, and assessing multilingual models for the first time. We employ the Amazon review dataset for our experiments. According to our study on several product categories, multilingual and monolingual pre-trained language models outperform the baseline that utilizes random forest with handcrafted features as much as 23% in RMSE. Pre-trained language models reduce the need for complex text feature engineering. However, our results suggest that pre-trained multilingual models may not be used for fine-tuning only one language. We assess the performance of language models with and without additional features. Our results show that including additional features like product rating by the reviewer can further help the predictive methods.

翻译：企业和客户可以从产品评论中获取有价值的信息。然而，评论数量庞大，常需根据其潜在有用性进行排序。但在线市场上，仅有少数评论能获得有用性投票。基于现有少量投票对所有评论排序，可能导致有用评论因读者关注力有限而被忽视。评论有用性预测问题在高评论量、新撰写或新发布产品时尤为重要。本研究比较了使用RoBERTa和XLM-R语言模型预测在线产品评论有用性的效果。相较于现有文献，我们的贡献包括：深入探究最先进语言模型（包括单语和多语）相对于稳健基线的有效性；在评估方法时纳入排名指标；并首次评估多语言模型。我们采用亚马逊评论数据集进行实验。针对多个产品类别的研究表明，多语言和单语预训练语言模型在均方根误差（RMSE）上比使用随机森林和手工特征构建的基线模型性能高出23%。预训练语言模型减少了对复杂文本特征工程的需求。然而，我们的结果表明，预训练多语言模型可能不适用于仅针对单一语言的微调。我们还评估了引入额外特征与否的语言模型性能。结果显示，加入如评论者的产品评分等额外特征可进一步提升预测方法的性能。