Long-form question answering (LFQA) aims to provide thorough and in-depth answers to complex questions, enhancing comprehension. However, such detailed responses are prone to hallucinations and factual inconsistencies, challenging their faithful evaluation. This work introduces HaluQuestQA, the first hallucination dataset with localized error annotations for human-written and model-generated LFQA answers. HaluQuestQA comprises 698 QA pairs with 1.8k span-level error annotations for five different error types by expert annotators, along with preference judgments. Using our collected data, we thoroughly analyze the shortcomings of long-form answers and find that they lack comprehensiveness and provide unhelpful references. We train an automatic feedback model on this dataset that predicts error spans with incomplete information and provides associated explanations. Finally, we propose a prompt-based approach, Error-informed refinement, that uses signals from the learned feedback model to refine generated answers, which we show reduces errors and improves answer quality across multiple models. Furthermore, humans find answers generated by our approach comprehensive and highly prefer them (84%) over the baseline answers.
翻译:长式问答(LFQA)旨在为复杂问题提供详尽深入的答案,以增强理解。然而,此类详细回答容易产生幻觉和事实不一致问题,这对其忠实性评估提出了挑战。本研究引入了HaluQuestQA,这是首个针对人工撰写和模型生成的长式问答答案、带有局部化错误标注的幻觉数据集。HaluQuestQA包含698个问答对,由专家标注员对五种不同错误类型进行了1.8千个跨度级错误标注,并附有偏好判断。利用我们收集的数据,我们深入分析了长式答案的不足之处,发现其缺乏全面性且提供了无益的参考文献。我们在此数据集上训练了一个自动反馈模型,该模型能够预测信息不完整的错误跨度并提供相关解释。最后,我们提出了一种基于提示的方法——错误信息精炼法,该方法利用从学习到的反馈模型中获得的信号来精炼生成的答案;我们证明该方法能减少错误并提高多个模型的答案质量。此外,人类评估者认为我们方法生成的答案全面,并对其表现出高度偏好(84%),优于基线答案。