SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization Evaluation

Elizabeth Clark,Shruti Rijhwani,Sebastian Gehrmann,Joshua Maynez,Roee Aharoni,Vitaly Nikolaev,Thibault Sellam,Aditya Siddhant,Dipanjan Das,Ankur P. Parikh

Reliable automatic evaluation of summarization systems is challenging due to the multifaceted and subjective nature of the task. This is especially the case for languages other than English, where human evaluations are scarce. In this work, we introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation. SEAHORSE consists of 96K summaries with human ratings along 6 dimensions of text quality: comprehensibility, repetition, grammar, attribution, main ideas, and conciseness, covering 6 languages, 9 systems and 4 datasets. As a result of its size and scope, SEAHORSE can serve both as a benchmark to evaluate learnt metrics, as well as a large-scale resource for training such metrics. We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE (Honovich et al., 2022) and mFACE (Aharoni et al., 2022). We make the SEAHORSE dataset and metrics publicly available for future research on multilingual and multifaceted summarization evaluation.

翻译：可靠的自动摘要评估系统面临挑战，原因在于摘要任务具有多维度性及主观性，尤其对于英语以外的其他语言，人类评估资源稀缺。在本研究中，我们提出了SEAHORSE，一个用于多语言、多维度摘要评估的数据集。SEAHORSE包含96K条摘要，每条摘要均涵盖文本质量的六个维度（可理解性、重复性、语法性、归因性、主旨性及简洁性）的人类评分，覆盖6种语言、9个系统及4个数据集。因其规模与覆盖范围，SEAHORSE既可作为评估学习型指标的基准，也可作为训练此类指标的大规模资源。实验表明，基于SEAHORSE训练的指标在跨领域元评估基准TRUE（Honovich等，2022）和mFACE（Aharoni等，2022）上取得了优异性能。我们公开了SEAHORSE数据集及指标，以支持未来多语言、多维度摘要评估相关研究。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日