The attribution of question answering is to provide citations for supporting generated statements, and has attracted wide research attention. The current methods for automatically evaluating the attribution, which are often based on Large Language Models (LLMs), are still inadequate, particularly in recognizing subtle differences between attributions, and complex relationships between citations and statements. To compare these attribution evaluation methods and develop new ones, we introduce a set of fine-grained categories (i.e., supportive, insufficient, contradictory and irrelevant) for measuring the attribution, and develop a Complex Attributed Question Answering (CAQA) benchmark by leveraging knowledge graphs (KGs) for automatically generating attributions of different categories to question-answer pairs. Our analysis reveals that existing evaluators perform poorly under fine-grained attribution settings and exhibit weaknesses in complex citation-statement reasoning. Our CAQA benchmark, validated with human annotations, emerges as a promising tool for selecting and developing LLM attribution evaluators.
翻译:问答归因旨在为生成的陈述提供引用支持,已引起广泛研究关注。当前基于大语言模型(LLM)的自动归因评估方法仍存在不足,特别是在识别归因间的细微差别以及引用与陈述间的复杂关系方面。为比较现有归因评估方法并开发新方法,我们引入了一套细粒度归因分类(即支持性、不充分性、矛盾性和无关性),并通过利用知识图谱(KG)自动生成不同归因类别的问答对,构建了复杂归因问答(CAQA)基准。分析表明,现有评估器在细粒度归因设置下表现不佳,且在复杂引用-陈述推理中暴露出缺陷。经人工标注验证,我们的CAQA基准可作为选择与开发LLM归因评估器的有效工具。