Reducing the `$\textit{hallucination}$' problem of Large Language Models (LLMs) is crucial for their wide applications. A comprehensive and fine-grained measurement of the hallucination is the first key step for the governance of this issue but is under-explored in the community. Thus, we present $\textbf{ANAH}$, a bilingual dataset that offers $\textbf{AN}$alytical $\textbf{A}$nnotation of $\textbf{H}$allucinations in LLMs within Generative Question Answering. Each answer sentence in our dataset undergoes rigorous annotation, involving the retrieval of a reference fragment, the judgment of the hallucination type, and the correction of hallucinated content. ANAH consists of ~12k sentence-level annotations for ~4.3k LLM responses covering over 700 topics, constructed by a human-in-the-loop pipeline. Thanks to the fine granularity of the hallucination annotations, we can quantitatively confirm that the hallucinations of LLMs progressively accumulate in the answer and use ANAH to train and evaluate hallucination annotators. We conduct extensive experiments on studying generative and discriminative annotators and show that, although current open-source LLMs have difficulties in fine-grained hallucination annotation, the generative annotator trained with ANAH can surpass all open-source LLMs and GPT-3.5, obtain performance competitive with GPT-4, and exhibits better generalization ability on unseen questions.
翻译:减少大语言模型(LLMs)的“幻觉”问题对其广泛应用至关重要。全面且细粒度地测量幻觉是治理该问题的首个关键步骤,但在学界尚未得到充分探索。为此,我们提出了**ANAH**,一个双语数据集,为生成式问答任务中LLMs产生的幻觉提供**分析式标注**。我们数据集中的每个答案句子都经过严格标注,包括检索参考片段、判断幻觉类型以及修正幻觉内容。ANAH包含约4.3k个LLM回答的约12k个句子级标注,覆盖超过700个主题,并通过人机协同流程构建。得益于幻觉标注的细粒度特性,我们能够定量地证实LLMs的幻觉在答案中会逐步累积,并利用ANAH来训练和评估幻觉标注器。我们进行了大量实验来研究生成式和判别式标注器,结果表明:尽管当前开源LLMs在细粒度幻觉标注方面存在困难,但使用ANAH训练的生成式标注器能够超越所有开源LLMs和GPT-3.5,获得与GPT-4相竞争的性能,并在未见问题上展现出更好的泛化能力。