Long-form question answering (LFQA) aims to provide thorough and in-depth answers to complex questions, enhancing comprehension. However, such detailed responses are prone to hallucinations and factual inconsistencies, challenging their faithful evaluation. This work introduces HaluQuestQA, the first hallucination dataset with localized error annotations for human-written and model-generated LFQA answers. HaluQuestQA comprises 698 QA pairs with 1.8k span-level error annotations for five different error types by expert annotators, along with preference judgments. Using our collected data, we thoroughly analyze the shortcomings of long-form answers and find that they lack comprehensiveness and provide unhelpful references. We train an automatic feedback model on this dataset that predicts error spans with incomplete information and provides associated explanations. Finally, we propose a prompt-based approach, Error-informed refinement, that uses signals from the learned feedback model to refine generated answers, which we show reduces errors and improves answer quality across multiple models. Furthermore, humans find answers generated by our approach comprehensive and highly prefer them (84%) over the baseline answers.
翻译:长文本问答旨在为复杂问题提供全面深入的答案,从而增强理解。然而,此类详细回答容易产生幻觉和事实不一致问题,这对其忠实性评估提出了挑战。本文介绍了HaluQuestQA,这是首个针对人工撰写和模型生成的长文本问答答案、带有局部错误标注的幻觉数据集。HaluQuestQA包含698个问答对,由专家标注员对五种不同错误类型提供了1.8k个跨度级别的错误标注,并附有偏好判断。利用我们收集的数据,我们深入分析了长文本答案的缺陷,发现其缺乏全面性且提供了无益的参考文献。我们在此数据集上训练了一个自动反馈模型,该模型能够预测信息不完整的错误跨度并提供相关解释。最后,我们提出了一种基于提示的方法——错误感知精炼,该方法利用从学习到的反馈模型中获取的信号来优化生成的答案。我们证明,该方法能减少多个模型中的错误并提高答案质量。此外,人类评估者认为我们方法生成的答案全面,并且高度偏好(84%)于这些答案而非基线答案。