Reliable automatic evaluation of summarization systems is challenging due to the multifaceted and subjective nature of the task. This is especially the case for languages other than English, where human evaluations are scarce. In this work, we introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation. SEAHORSE consists of 96K summaries with human ratings along 6 quality dimensions: comprehensibility, repetition, grammar, attribution, main ideas, and conciseness, covering 6 languages, 9 systems and 4 datasets. As a result of its size and scope, SEAHORSE can serve both as a benchmark to evaluate learnt metrics, as well as a large-scale resource for training such metrics. We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE (Honovich et al., 2022) and mFACE (Aharoni et al., 2022). We make SEAHORSE publicly available for future research on multilingual and multifaceted summarization evaluation.
翻译:可靠的自动摘要评估系统因任务的多元性和主观性而面临挑战,尤其是在英语以外的语言中,人工评估资源匮乏。本研究提出SEAHORSE——一个面向多语言、多维度摘要评估的数据集。该数据集包含96,000条摘要及其人工评分,覆盖6个质量维度:可理解性、重复性、语法准确性、归因性、主旨表达和简洁性,涵盖6种语言、9个生成系统和4个基准数据集。凭借其规模与覆盖范围,SEAHORSE既可作为评估学习型指标的基准,也可作为训练此类指标的大规模资源。实验表明,基于SEAHORSE训练的指标在域外元评估基准TRUE(Honovich等,2022)和mFACE(Aharoni等,2022)上取得了优异性能。我们已公开SEAHORSE数据集,供未来多语言、多维度摘要评估研究使用。