Large Language Models (LLMs) have shown impressive capabilities but also a concerning tendency to hallucinate. This paper presents RefChecker, a framework that introduces claim-triplets to represent claims in LLM responses, aiming to detect fine-grained hallucinations. In RefChecker, an extractor generates claim-triplets from a response, which are then evaluated by a checker against a reference. We delineate three task settings: Zero, Noisy and Accurate Context, to reflect various real-world use cases. We curated a benchmark spanning various NLP tasks and annotated 11k claim-triplets from 2.1k responses by seven LLMs. RefChecker supports both proprietary and open-source models as the extractor and checker. Experiments demonstrate that claim-triplets enable superior hallucination detection, compared to other granularities such as response, sentence and sub-sentence level claims. RefChecker outperforms prior methods by 6.8 to 26.1 points on our benchmark and the checking results of RefChecker are strongly aligned with human judgments. This work is open sourced at https://github.com/amazon-science/RefChecker
翻译:大语言模型(LLMs)展现出卓越能力的同时,也存在令人担忧的幻觉倾向。本文提出RefChecker框架,通过引入声明三元组来表征大语言模型响应中的声明,旨在实现细粒度幻觉检测。在RefChecker中,抽取器从响应中生成声明三元组,随后检查器依据参考文本对这些三元组进行评估。我们划分了三种任务场景:零上下文、噪声上下文与准确上下文,以反映现实世界中的不同应用情境。我们构建了涵盖多种自然语言处理任务的基准数据集,并对来自七个大语言模型的2.1k条响应中的11k个声明三元组进行了标注。RefChecker支持使用专有模型和开源模型分别作为抽取器与检查器。实验表明,相较于响应级、句子级和子句级声明等其他粒度,声明三元组能够实现更优的幻觉检测效果。RefChecker在我们的基准测试中较现有方法提升6.8至26.1个百分点,其检测结果与人类判断高度一致。本研究成果已在https://github.com/amazon-science/RefChecker开源发布。