Retrieval-augmented generation (RAG) has become a main technique for alleviating hallucinations in large language models (LLMs). Despite the integration of RAG, LLMs may still present unsupported or contradictory claims to the retrieved contents. In order to develop effective hallucination prevention strategies under RAG, it is important to create benchmark datasets that can measure the extent of hallucination. This paper presents RAGTruth, a corpus tailored for analyzing word-level hallucinations in various domains and tasks within the standard RAG frameworks for LLM applications. RAGTruth comprises nearly 18,000 naturally generated responses from diverse LLMs using RAG. These responses have undergone meticulous manual annotations at both the individual cases and word levels, incorporating evaluations of hallucination intensity. We not only benchmark hallucination frequencies across different LLMs, but also critically assess the effectiveness of several existing hallucination detection methodologies. Furthermore, we show that using a high-quality dataset such as RAGTruth, it is possible to finetune a relatively small LLM and achieve a competitive level of performance in hallucination detection when compared to the existing prompt-based approaches using state-of-the-art large language models such as GPT-4.
翻译:检索增强生成(RAG)已成为缓解大语言模型(LLM)中幻觉的主要技术。尽管集成了RAG,LLM仍可能对检索内容提出无依据或矛盾的表述。为制定有效的RAG幻觉预防策略,构建能衡量幻觉程度的基准数据集至关重要。本文提出RAGTruth,这是一个专为标准RAG框架下不同领域和任务中分析词汇级幻觉而设计的语料库。RAGTruth包含来自多个使用RAG的LLM近18,000条自然生成的回复。这些回复在单案例和词汇层面均经过精细的人工标注,并纳入幻觉强度评估。我们不仅测算了不同LLM的幻觉发生频率,还批判性评估了现有多种幻觉检测方法的有效性。此外,我们证明,使用如RAGTruth这样的高质量数据集,能够微调一个相对较小的LLM,并在幻觉检测中达到与现有基于提示的方法(如使用GPT-4等最先进大语言模型)相媲美的竞争性能。