Reliable automatic evaluation of summarization systems is challenging due to the multifaceted and subjective nature of the task. This is especially the case for languages other than English, where human evaluations are scarce. In this work, we introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation. SEAHORSE consists of 96K summaries with human ratings along 6 dimensions of text quality: comprehensibility, repetition, grammar, attribution, main ideas, and conciseness, covering 6 languages, 9 systems and 4 datasets. As a result of its size and scope, SEAHORSE can serve both as a benchmark to evaluate learnt metrics, as well as a large-scale resource for training such metrics. We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE (Honovich et al., 2022) and mFACE (Aharoni et al., 2022). We make the SEAHORSE dataset and metrics publicly available for future research on multilingual and multifaceted summarization evaluation.
翻译:可靠的自动摘要评估系统面临挑战,原因在于摘要任务具有多维度性及主观性,尤其对于英语以外的其他语言,人类评估资源稀缺。在本研究中,我们提出了SEAHORSE,一个用于多语言、多维度摘要评估的数据集。SEAHORSE包含96K条摘要,每条摘要均涵盖文本质量的六个维度(可理解性、重复性、语法性、归因性、主旨性及简洁性)的人类评分,覆盖6种语言、9个系统及4个数据集。因其规模与覆盖范围,SEAHORSE既可作为评估学习型指标的基准,也可作为训练此类指标的大规模资源。实验表明,基于SEAHORSE训练的指标在跨领域元评估基准TRUE(Honovich等,2022)和mFACE(Aharoni等,2022)上取得了优异性能。我们公开了SEAHORSE数据集及指标,以支持未来多语言、多维度摘要评估相关研究。