Existing metrics for evaluating the quality of automatically generated questions such as BLEU, ROUGE, BERTScore, and BLEURT compare the reference and predicted questions, providing a high score when there is a considerable lexical overlap or semantic similarity between the candidate and the reference questions. This approach has two major shortcomings. First, we need expensive human-provided reference questions. Second, it penalises valid questions that may not have high lexical or semantic similarity to the reference questions. In this paper, we propose a new metric, RQUGE, based on the answerability of the candidate question given the context. The metric consists of a question-answering and a span scorer modules, using pre-trained models from existing literature, thus it can be used without any further training. We demonstrate that RQUGE has a higher correlation with human judgment without relying on the reference question. Additionally, RQUGE is shown to be more robust to several adversarial corruptions. Furthermore, we illustrate that we can significantly improve the performance of QA models on out-of-domain datasets by fine-tuning on synthetic data generated by a question generation model and re-ranked by RQUGE.
翻译:现有用于评估自动生成问题质量的指标,如BLEU、ROUGE、BERTScore和BLEURT,通过比较参考问题与预测问题,在候选问题与参考问题之间存在显著词汇重叠或语义相似度时给出高分。该方法存在两大缺陷:首先,需要昂贵的人工标注参考问题;其次,会对与参考问题缺乏高词汇或语义相似度但合理的有效问题进行惩罚。本文提出一种基于候选问题在给定上下文中的可回答性的新指标RQUGE。该指标由问答模块和跨度评分模块组成,采用现有文献中的预训练模型,因此无需额外训练即可使用。我们证明RQUGE在不依赖参考问题的情况下,与人类判断具有更高相关性。此外,RQUGE对多种对抗性干扰表现出更强的鲁棒性。进一步实验表明,通过使用问题生成模型生成并经RQUGE重排的合成数据微调问答模型,可显著提升其在域外数据集上的性能。