Large language models (LLMs) can be prone to hallucinations - generating unreliable outputs that are unfaithful to their inputs, external facts or internally inconsistent. In this work, we address several challenges for post-hoc hallucination detection in production settings. Our pipeline for hallucination detection entails: first, producing a confidence score representing the likelihood that a generated answer is a hallucination; second, calibrating the score conditional on attributes of the inputs and candidate response; finally, performing detection by thresholding the calibrated score. We benchmark a variety of state-of-the-art scoring methods on different datasets, encompassing question answering, fact checking, and summarization tasks. We employ diverse LLMs to ensure a comprehensive assessment of performance. We show that calibrating individual scoring methods is critical for ensuring risk-aware downstream decision making. Based on findings that no individual score performs best in all situations, we propose a multi-scoring framework, which combines different scores and achieves top performance across all datasets. We further introduce cost-effective multi-scoring, which can match or even outperform more expensive detection methods, while significantly reducing computational overhead.
翻译:大型语言模型(LLMs)容易产生幻觉——生成不可靠的输出,这些输出可能不忠实于输入内容、外部事实,或存在内部不一致性。在本研究中,我们针对生产环境中的事后幻觉检测所面临的若干挑战提出了解决方案。我们的幻觉检测流程包括:首先,生成一个置信度分数,用以表示生成答案为幻觉的可能性;其次,根据输入属性和候选响应的特征对该分数进行条件校准;最后,通过设定阈值对校准后的分数进行检测判定。我们在多个数据集上对多种前沿的评分方法进行了基准测试,涵盖问答、事实核查和摘要生成等任务。我们采用了多样化的LLMs以确保对性能进行全面评估。研究表明,对个体评分方法进行校准对于确保下游风险感知决策至关重要。基于“没有单一评分方法在所有情况下都表现最佳”这一发现,我们提出了一种多评分融合框架,该框架通过组合不同评分方法,在所有数据集上均取得了最优性能。我们进一步引入了高性价比的多评分策略,该方法在显著降低计算开销的同时,能够达到甚至超越成本更高的检测方法的性能。