Conventional automatic evaluation metrics, such as BLEU and ROUGE, developed for natural language generation (NLG) tasks, are based on measuring the n-gram overlap between the generated and reference text. These simple metrics may be insufficient for more complex tasks, such as question generation (QG), which requires generating questions that are answerable by the reference answers. Developing a more sophisticated automatic evaluation metric, thus, remains as an urgent problem in QG research. This work proposes a Prompting-based Metric on ANswerability (PMAN), a novel automatic evaluation metric to assess whether the generated questions are answerable by the reference answers for the QG tasks. Extensive experiments demonstrate that its evaluation results are reliable and align with human evaluations. We further apply our metric to evaluate the performance of QG models, which shows our metric complements conventional metrics. Our implementation of a ChatGPT-based QG model achieves state-of-the-art (SOTA) performance in generating answerable questions.
翻译:传统自动评估指标(如BLEU和ROUGE)专为自然语言生成任务设计,其核心机制是基于生成文本与参考文本的n-gram重叠度进行测量。这些简单指标难以胜任更复杂的任务(如问题生成),后者要求生成的问题需能被参考答案正确回答。因此,为问题生成研究领域开发更精密的自动评估指标仍是亟待解决的问题。本文提出基于提示的可回答性度量(PMAN),这是一种新型自动评估指标,专门用于评估问题生成任务中生成问题是否可被参考答案回答。大量实验表明,该指标的评估结果可靠且与人工评估高度一致。我们进一步将该指标应用于评估问题生成模型性能,发现其可有效补充传统指标的不足。基于ChatGPT实现的QG模型在生成可回答问题方面达到了当前最优水平。