Non-Factoid (NF) Question Answering (QA) is challenging to evaluate due to diverse potential answers and no objective criterion. The commonly used automatic evaluation metrics like ROUGE or BERTScore cannot accurately measure semantic similarities or answers from different perspectives. Recently, Large Language Models (LLMs) have been resorted to for NFQA evaluation due to their compelling performance on various NLP tasks. Common approaches include pointwise scoring of each candidate answer and pairwise comparisons between answers. Inspired by the evolution from pointwise to pairwise to listwise in learning-to-rank methods, we propose a novel listwise NFQA evaluation approach, that utilizes LLMs to rank candidate answers in a list of reference answers sorted by descending quality. Moreover, for NF questions that do not have multi-grade or any golden answers, we leverage LLMs to generate the reference answer list of various quality to facilitate the listwise evaluation. Extensive experimental results on three NFQA datasets, i.e., ANTIQUE, the TREC-DL-NF, and WebGLM show that our method has significantly higher correlations with human annotations compared to automatic scores and common pointwise and pairwise approaches.
翻译:非事实性问答(NFQA)由于潜在答案的多样性和缺乏客观评判标准而难以评估。常用的自动评估指标(如ROUGE或BERTScore)无法准确衡量语义相似度或从不同角度评估答案。近年来,大语言模型(LLMs)因其在各种自然语言处理任务中的出色表现,被用于NFQA评估。常见方法包括对每个候选答案进行逐点评分以及答案间的两两比较。受学习排序方法从逐点、逐对到列表排序演进的启发,我们提出了一种新颖的列表式NFQA评估方法,该方法利用LLMs在按质量降序排列的参考答案列表中排序候选答案。此外,对于没有多等级或任何标准答案的NF问题,我们利用LLMs生成不同质量的参考答案列表以支持列表式评估。在三个NFQA数据集(即ANTIQUE、TREC-DL-NF和WebGLM)上的大量实验结果表明,与自动评分及常见的逐点和逐对方法相比,我们的方法与人工标注具有显著更高的相关性。