Despite the significant advancements in keyphrase extraction and keyphrase generation methods, the predominant approach for evaluation only relies on exact matching with human references and disregards reference-free attributes. This scheme fails to recognize systems that generate keyphrases that are semantically equivalent to the references or keyphrases that have practical utility. To better understand the strengths and weaknesses of different keyphrase systems, we propose a comprehensive evaluation framework consisting of six critical dimensions: naturalness, faithfulness, saliency, coverage, diversity, and utility. For each dimension, we discuss the desiderata and design semantic-based metrics that align with the evaluation objectives. Rigorous meta-evaluation studies demonstrate that our evaluation strategy correlates better with human preferences compared to a range of previously used metrics. Using this framework, we re-evaluate 18 keyphrase systems and further discover that (1) the best model differs in different dimensions, with pre-trained language models achieving the best in most dimensions; (2) the utility in downstream tasks does not always correlate well with reference-based metrics; and (3) large language models exhibit a strong performance in reference-free evaluation.
翻译:尽管关键短语提取与生成方法取得了显著进展,但主流评估方式仍仅依赖与人工标注的精确匹配,忽略了无参考属性的评估。这种方案无法识别生成与参考语义等价或具备实用价值的关键短语的系统。为深入理解不同关键短语系统的优势与缺陷,我们提出包含六个关键维度的综合评估框架:自然性、忠实性、显著性、覆盖率、多样性与实用性。针对每个维度,我们阐述设计原则并构建与评估目标一致的语义指标。严格的元评估研究表明,与多种先前使用的指标相比,我们的评估策略与人类偏好具有更强的相关性。基于该框架,我们重新评估了18个关键短语系统,并发现:(1) 最优模型在不同维度上表现各异,其中预训练语言模型在多数维度上取得最佳效果;(2) 下游任务的实用性与基于参考的指标并不总是高度相关;(3) 大语言模型在无参考评估中展现出强劲性能。