Despite the significant advancements in keyphrase extraction and keyphrase generation methods, the predominant approach for evaluation only relies on exact matching with human references and disregards reference-free attributes. This scheme fails to recognize systems that generate keyphrases semantically equivalent to the references or diverse keyphrases that carry practical utility. To better assess the capability of keyphrase systems, we propose KPEval, a comprehensive evaluation framework consisting of four critical dimensions: saliency, faithfulness, diversity, and utility. For each dimension, we design semantic-based metrics that align with the evaluation objectives. Meta-evaluation studies demonstrate that our evaluation strategy correlates better with human preferences compared to a range of previously used metrics. Using this framework, we re-evaluate 20 keyphrase systems and further discover that (1) the best model differs depending on the evaluation dimension; (2) the utility in downstream tasks does not always correlate with reference-based metrics; and (3) large language models like GPT-3.5 exhibit a strong performance under reference-free evaluation.
翻译:尽管关键短语提取与生成方法取得了显著进展,但主流评估方式仅依赖与人工标注参考的精确匹配,并忽略了无参考属性。这种机制无法识别生成语义等价关键短语的系统,也无法评估具有实际效用的多样性关键短语。为更好地评估关键短语系统的能力,我们提出KPEval——一个包含显著性、忠实性、多样性和效用性四个关键维度的综合评估框架。针对每个维度,我们设计了与评估目标相一致的语义指标。元评估研究表明,与以往使用的多种指标相比,我们的评估策略与人类偏好具有更高相关性。运用该框架,我们重新评估了20个关键短语系统,并进一步发现:(1) 最优模型因评估维度不同而异;(2) 下游任务中的效用性并不总是与基于参考的指标相关;(3) 大语言模型(如GPT-3.5)在无参考评估中展现出强大性能。