Sentiment analysis is a well-established natural language processing task, with sentiment polarity classification being one of its most popular and representative tasks. However, despite the success of pre-trained language models in this area, they often fall short of capturing the broader complexities of sentiment analysis. To address this issue, we propose a new task called Sentiment and Opinion Understanding of Language (SOUL). SOUL aims to evaluate sentiment understanding through two subtasks: Review Comprehension (RC) and Justification Generation (JG). RC seeks to validate statements that focus on subjective information based on a review text, while JG requires models to provide explanations for their sentiment predictions. To enable comprehensive evaluation, we annotate a new dataset comprising 15,028 statements from 3,638 reviews. Experimental results indicate that SOUL is a challenging task for both small and large language models, with a performance gap of up to 27% when compared to human performance. Furthermore, evaluations conducted with both human experts and GPT-4 highlight the limitations of the small language model in generating reasoning-based justifications. These findings underscore the challenging nature of the SOUL task for existing models, emphasizing the need for further advancements in sentiment analysis to address its complexities. The new dataset and code are available at https://github.com/DAMO-NLP-SG/SOUL.
翻译:情感分析是一项成熟的自然语言处理任务,其中情感极性分类是最流行且最具代表性的任务之一。然而,尽管预训练语言模型在此领域取得了成功,但它们往往难以捕捉情感分析中更广泛的复杂性。为解决这一问题,我们提出了一项新任务,称为语言的情感与观点理解(SOUL)。SOUL旨在通过两个子任务评估情感理解能力:评论理解(RC)和理由生成(JG)。RC旨在基于评论文本验证聚焦于主观信息的陈述,而JG则要求模型为其情感预测提供解释。为促进全面评估,我们标注了一个新数据集,包含来自3,638条评论的15,028条陈述。实验结果表明,对于小型和大型语言模型,SOUL都是一项具有挑战性的任务,其性能与人类表现相比差距高达27%。此外,与人类专家和GPT-4共同进行的评估凸显了小型语言模型在生成基于推理的理由方面的局限性。这些发现强调了现有模型在SOUL任务上面临的挑战性,并指出需要进一步推进情感分析以应对其复杂性。新数据集和代码可在https://github.com/DAMO-NLP-SG/SOUL获取。