Retrieval Augmented Generation (RAG) techniques aim to mitigate hallucinations in Large Language Models (LLMs). However, LLMs can still produce information that is unsupported or contradictory to the retrieved contexts. We introduce LYNX, a SOTA hallucination detection LLM that is capable of advanced reasoning on challenging real-world hallucination scenarios. To evaluate LYNX, we present HaluBench, a comprehensive hallucination evaluation benchmark, consisting of 15k samples sourced from various real-world domains. Our experiment results show that LYNX outperforms GPT-4o, Claude-3-Sonnet, and closed and open-source LLM-as-a-judge models on HaluBench. We release LYNX, HaluBench and our evaluation code for public access.
翻译:检索增强生成(RAG)技术旨在缓解大型语言模型(LLM)中的幻觉问题。然而,LLM仍可能生成与检索上下文不符或相矛盾的信息。我们提出了LYNX,一个最先进的幻觉检测LLM,它能够对具有挑战性的真实世界幻觉场景进行高级推理。为了评估LYNX,我们提出了HaluBench,一个全面的幻觉评估基准,包含来自多个真实领域共计1.5万个样本。我们的实验结果表明,在HaluBench上,LYNX的表现优于GPT-4o、Claude-3-Sonnet以及闭源和开源的LLM-as-a-judge模型。我们公开发布了LYNX、HaluBench及评估代码。