Long-form question answering (LFQA) enables answering a wide range of questions, but its flexibility poses enormous challenges for evaluation. We perform the first targeted study of the evaluation of long-form answers, covering both human and automatic evaluation practices. We hire domain experts in seven areas to provide preference judgments over pairs of answers, along with free-form justifications for their choices. We present a careful analysis of experts' evaluation, which focuses on new aspects such as the comprehensiveness of the answer. Next, we examine automatic text generation metrics, finding that no existing metrics are predictive of human preference judgments. However, some metrics correlate with fine-grained aspects of answers (e.g., coherence). We encourage future work to move away from a single "overall score" of the answer and adopt a multi-faceted evaluation, targeting aspects such as factuality and completeness. We publicly release all of our annotations and code to spur future work into LFQA evaluation.
翻译:长文本问答(LFQA)能够回答广泛的问题,但其灵活性给评估带来了巨大挑战。我们首次针对长文本答案的评估开展专项研究,涵盖人工评估与自动评估两种实践。我们聘请七个领域的领域专家对成对答案进行偏好判断,并附上其选择理由的自由文本说明。通过审慎分析专家评估结果,我们重点关注答案全面性等新维度。接着,我们考察自动文本生成指标,发现现有指标均无法预测人类偏好判断。但部分指标与答案的细粒度特征(如连贯性)存在相关性。我们建议未来研究摒弃单一的"总体评分",采用多维度评估体系,重点考察事实准确性与完整性等维度。我们已公开发布全部标注数据和代码,以推动LFQA评估领域的后续研究。