Automatic evaluation metrics have been facilitating the rapid development of automatic summarization methods by providing instant and fair assessments of the quality of summaries. Most metrics have been developed for the general domain, especially news and meeting notes, or other language-generation tasks. However, these metrics are applied to evaluate summarization systems in different domains, such as biomedical question summarization. To better understand whether commonly used evaluation metrics are capable of evaluating automatic summarization in the biomedical domain, we conduct human evaluations of summarization quality from four different aspects of a biomedical question summarization task. Based on human judgments, we identify different noteworthy features for current automatic metrics and summarization systems as well. We also release a dataset of our human annotations to aid the research of summarization evaluation metrics in the biomedical domain.
翻译:摘要:自动评估指标通过提供对摘要质量的即时和公正评估,促进了自动摘要方法的快速发展。大多数指标是为通用领域(尤其是新闻和会议记录)或其他语言生成任务而开发的。然而,这些指标被用于评估不同领域的摘要系统,例如生物医学问题摘要。为了更深入地理解常用评估指标是否能够有效评估生物医学领域的自动摘要性能,我们从生物医学问题摘要任务的四个不同方面对摘要质量进行了人工评估。基于人工判断,我们识别出了当前自动指标和摘要系统的不同值得关注的特征。此外,我们还发布了一个人工标注数据集,以促进生物医学领域摘要评估指标的研究。