Opinion summarization sets itself apart from other types of summarization tasks due to its distinctive focus on aspects and sentiments. Although certain automated evaluation methods like ROUGE have gained popularity, we have found them to be unreliable measures for assessing the quality of opinion summaries. In this paper, we present OpinSummEval, a dataset comprising human judgments and outputs from 14 opinion summarization models. We further explore the correlation between 24 automatic metrics and human ratings across four dimensions. Our findings indicate that metrics based on neural networks generally outperform non-neural ones. However, even metrics built on powerful backbones, such as BART and GPT-3/3.5, do not consistently correlate well across all dimensions, highlighting the need for advancements in automated evaluation methods for opinion summarization. The code and data are publicly available at https://github.com/A-Chicharito-S/OpinSummEval/tree/main.
翻译:观点摘要因其对方面和情感的独特关注,而与其他类型的摘要任务有所不同。尽管诸如ROUGE等自动评估方法已受到广泛欢迎,但我们发现它们在衡量观点摘要质量方面并不可靠。本文提出了OpinSummEval数据集,包含来自14个观点摘要模型的人工判断与输出结果。我们进一步探究了24种自动评估指标与人类评分在四个维度上的相关性。研究结果表明,基于神经网络的指标普遍优于非神经指标。然而,即使基于强大骨干网络(如BART和GPT-3/3.5)构建的指标,在所有维度上也未能始终呈现良好相关性,这凸显了观点摘要自动评估方法亟需改进。代码与数据已在https://github.com/A-Chicharito-S/OpinSummEval/tree/main公开提供。