Retrieval-augmented generation (RAG) has become a main technique for alleviating hallucinations in large language models (LLMs). Despite the integration of RAG, LLMs may still present unsupported or contradictory claims to the retrieved contents. In order to develop effective hallucination prevention strategies under RAG, it is important to create benchmark datasets that can measure the extent of hallucination. This paper presents RAGTruth, a corpus tailored for analyzing word-level hallucinations in various domains and tasks within the standard RAG frameworks for LLM applications. RAGTruth comprises nearly 18,000 naturally generated responses from diverse LLMs using RAG. These responses have undergone meticulous manual annotations at both the individual cases and word levels, incorporating evaluations of hallucination intensity. We not only benchmark hallucination frequencies across different LLMs, but also critically assess the effectiveness of several existing hallucination detection methodologies. Furthermore, we show that using a high-quality dataset such as RAGTruth, it is possible to finetune a relatively small LLM and achieve a competitive level of performance in hallucination detection when compared to the existing prompt-based approaches using state-of-the-art large language models such as GPT-4.
翻译:检索增强生成(RAG)已成为缓解大型语言模型(LLMs)幻觉的主要技术。尽管集成了RAG,LLMs仍可能呈现与检索内容不符或矛盾的陈述。为制定有效的RAG下幻觉预防策略,创建能衡量幻觉程度的基准数据集至关重要。本文提出RAGTruth,这是一个专门用于分析LLM应用中标准RAG框架下跨领域和任务单词级幻觉的语料库。RAGTruth包含来自多种使用RAG的LLMs近18,000个自然生成的回答,这些回答在单个案例和单词层面均经过细致的人工标注,并包含幻觉强度的评估。我们不仅基准测试了不同LLMs的幻觉频率,还批判性地评估了几种现有幻觉检测方法的有效性。此外,我们证明使用高质量数据集(如RAGTruth),微调相对较小的LLM即可在幻觉检测中达到与基于GPT-4等最先进大型语言模型的现有提示方法相竞争的性能水平。