Sentiment Classification is a fundamental task in the field of Natural Language Processing, and has very important academic and commercial applications. It aims to automatically predict the degree of sentiment present in a text that contains opinions and subjectivity at some level, like product and movie reviews, or tweets. This can be really difficult to accomplish, in part, because different domains of text contains different words and expressions. In addition, this difficulty increases when text is written in a non-English language due to the lack of databases and resources. As a consequence, several cross-domain and cross-language techniques are often applied to this task in order to improve the results. In this work we perform a study on the ability of a classification system trained with a large database of product reviews to generalize to different Spanish domains. Reviews were collected from the MercadoLibre website from seven Latin American countries, allowing the creation of a large and balanced dataset. Results suggest that generalization across domains is feasible though very challenging when trained with these product reviews, and can be improved by pre-training and fine-tuning the classification model.
翻译:情感分类是自然语言处理领域的一项基础任务,具有重要的学术和商业应用价值。该任务旨在自动预测包含观点和主观性文本(如产品评论、电影评论及推文)中的情感倾向程度。由于不同领域的文本包含不同的词汇和表达方式,完成此项任务颇具挑战性。此外,当文本使用非英语语言时,受限于数据库和资源的匮乏,这一挑战更加突出。因此,跨领域和跨语言技术常被应用于该任务以提升效果。本研究探究了基于大型产品评论数据库训练的分类系统在泛化至不同西班牙语领域时的能力。我们通过收集来自七个拉丁美洲国家MercadoLibre网站的产品评论,构建了一个大规模且平衡的数据集。实验结果表明,尽管基于这些产品评论训练的模型进行跨领域泛化极具挑战性,但通过预训练和微调分类模型可有效提升泛化性能。