Conventional automatic evaluation metrics, such as BLEU and ROUGE, developed for natural language generation (NLG) tasks, are based on measuring the n-gram overlap between the generated and reference text. These simple metrics may be insufficient for more complex tasks, such as question generation (QG), which requires generating questions that are answerable by the reference answers. Developing a more sophisticated automatic evaluation metric, thus, remains an urgent problem in QG research. This work proposes PMAN (Prompting-based Metric on ANswerability), a novel automatic evaluation metric to assess whether the generated questions are answerable by the reference answers for the QG tasks. Extensive experiments demonstrate that its evaluation results are reliable and align with human evaluations. We further apply our metric to evaluate the performance of QG models, which shows that our metric complements conventional metrics. Our implementation of a GPT-based QG model achieves state-of-the-art performance in generating answerable questions.
翻译:传统的自动评估指标,如BLEU和ROUGE,是为自然语言生成任务开发的,基于测量生成文本与参考文本之间的n-gram重叠。这些简单指标可能不足以应对更复杂的任务,例如问题生成,该任务要求生成的问题能被参考答案回答。因此,开发更复杂的自动评估指标仍是问题生成研究中的迫切问题。本文提出PMAN(基于提示的可回答性度量),一种新颖的自动评估指标,用于评估问题生成任务中生成的问题是否可被参考答案回答。大量实验表明,其评估结果可靠且与人工评估一致。我们进一步将该指标应用于评估问题生成模型的性能,结果显示该指标补充了传统指标。我们实现的基于GPT的问题生成模型在生成可回答的问题方面达到了当前最优性能。