Despite the significant advancements in keyphrase extraction and keyphrase generation methods, the predominant approach for evaluation mainly relies on exact matching with human references. This scheme fails to recognize systems that generate keyphrases semantically equivalent to the references or diverse keyphrases that carry practical utility. To better assess the capability of keyphrase systems, we propose KPEval, a comprehensive evaluation framework consisting of four critical aspects: reference agreement, faithfulness, diversity, and utility. For each aspect, we design semantic-based metrics to reflect the evaluation objectives. Meta-evaluation studies demonstrate that our evaluation strategy correlates better with human preferences compared to a range of previously proposed metrics. Using KPEval, we re-evaluate 23 keyphrase systems and discover that (1) established model comparison results have blind-spots especially when considering reference-free evaluation; (2) large language models are underestimated by prior evaluation works; and (3) there is no single best model that can excel in all the aspects.
翻译:尽管关键词提取与生成方法已取得显著进展,当前的主流评估方法仍主要依赖与人工标注参考的精确匹配。该方案无法识别那些生成语义上与参考关键词等价、或生成具有实际效用的多样化关键词的系统。为更准确地评估关键词系统的能力,我们提出KPEval——一个包含参考一致性、忠实性、多样性与实用性四个关键维度的综合评估框架。针对每个维度,我们设计了基于语义的度量指标以反映评估目标。元评估研究表明,相较于一系列先前提出的指标,我们的评估策略与人类偏好具有更好的相关性。通过应用KPEval,我们对23个关键词系统进行了重新评估,发现:(1)既有的模型比较结果存在盲点,尤其在考虑无参考评估时;(2)大型语言模型在以往的评估工作中被低估;(3)不存在能在所有维度均表现卓越的单一最佳模型。