Reference-based metrics such as BLEU and BERTScore are widely used to evaluate question generation (QG). In this study, on QG benchmarks such as SQuAD and HotpotQA, we find that using human-written references cannot guarantee the effectiveness of the reference-based metrics. Most QG benchmarks have only one reference; we replicated the annotation process and collect another reference. A good metric was expected to grade a human-validated question no worse than generated questions. However, the results of reference-based metrics on our newly collected reference disproved the metrics themselves. We propose a reference-free metric consisted of multi-dimensional criteria such as naturalness, answerability, and complexity, utilizing large language models. These criteria are not constrained to the syntactic or semantic of a single reference question, and the metric does not require a diverse set of references. Experiments reveal that our metric accurately distinguishes between high-quality questions and flawed ones, and achieves state-of-the-art alignment with human judgment.
翻译:基于参考的度量标准(如BLEU和BERTScore)被广泛用于评估问题生成(QG)任务。在本研究中,我们在SQuAD和HotpotQA等QG基准测试中发现,使用人工撰写的参考问题并不能保证基于参考的度量标准的有效性。大多数QG基准测试仅提供一个参考问题;我们复现了标注流程并收集了另一个参考问题。一个好的度量标准理应给予人工验证的问题不低于生成问题的评分。然而,基于参考的度量标准在我们新收集的参考问题上的结果却证伪了这些度量标准自身。我们提出了一种无参考的度量标准,该标准利用大语言模型,包含自然性、可回答性和复杂性等多维准则。这些准则不受限于单个参考问题的句法或语义,且该度量标准无需依赖多样化的参考集。实验表明,我们的度量标准能够准确区分高质量问题与存在缺陷的问题,并在与人类判断的一致性方面达到了最先进的水平。