Reference-based metrics such as BLEU and BERTScore are widely used to evaluate question generation (QG). In this study, on QG benchmarks such as SQuAD and HotpotQA, we find that using human-written references cannot guarantee the effectiveness of the reference-based metrics. Most QG benchmarks have only one reference; we replicate the annotation process and collect another reference. A good metric is expected to grade a human-validated question no worse than generated questions. However, the results of reference-based metrics on our newly collected reference disproved the metrics themselves. We propose a reference-free metric consisted of multi-dimensional criteria such as naturalness, answerability, and complexity, utilizing large language models. These criteria are not constrained to the syntactic or semantic of a single reference question, and the metric does not require a diverse set of references. Experiments reveal that our metric accurately distinguishes between high-quality questions and flawed ones, and achieves state-of-the-art alignment with human judgment.
翻译:诸如BLEU和BERTScore等基于参考指标的评估方法被广泛用于问题生成(QG)任务的评估。在本研究中,我们在SQuAD和HotpotQA等QG基准测试中发现,使用人工撰写的参考问题并不能保证基于参考指标评估的有效性。大多数QG基准测试仅提供一个参考问题;我们复现了标注流程并收集了另一个参考问题。一个好的评估指标应确保其对人工验证问题的评分不低于对生成问题的评分。然而,基于参考指标在我们新收集的参考问题上的评估结果却否定了这些指标自身的有效性。我们提出了一种无需参考的评估指标,该指标利用大型语言模型,包含自然性、可回答性和复杂性等多维标准。这些标准不受限于单个参考问题的句法或语义,且该指标无需多样化的参考集。实验表明,我们的指标能够准确区分高质量问题与存在缺陷的问题,并在与人类判断的一致性方面达到了最先进的水平。