Large language models (LLMs) can be prone to hallucinations - generating unreliable outputs that are unfaithful to their inputs, external facts or internally inconsistent. In this work, we address several challenges for post-hoc hallucination detection in production settings. Our pipeline for hallucination detection entails: first, producing a confidence score representing the likelihood that a generated answer is a hallucination; second, calibrating the score conditional on attributes of the inputs and candidate response; finally, performing detection by thresholding the calibrated score. We benchmark a variety of state-of-the-art scoring methods on different datasets, encompassing question answering, fact checking, and summarization tasks. We employ diverse LLMs to ensure a comprehensive assessment of performance. We show that calibrating individual scoring methods is critical for ensuring risk-aware downstream decision making. Based on findings that no individual score performs best in all situations, we propose a multi-scoring framework, which combines different scores and achieves top performance across all datasets. We further introduce cost-effective multi-scoring, which can match or even outperform more expensive detection methods, while significantly reducing computational overhead.
翻译:大型语言模型(LLMs)容易产生幻觉——生成不可靠的输出,这些输出可能不忠实于输入内容、外部事实或存在内部不一致性。本研究针对生产环境中事后幻觉检测面临的若干挑战提出解决方案。我们的幻觉检测流程包括:首先,生成一个置信度分数,用以表示生成答案是幻觉的可能性;其次,根据输入属性和候选响应的特征对该分数进行条件校准;最后,通过阈值化校准后的分数执行检测。我们在涵盖问答、事实核查和摘要任务的不同数据集上,对多种前沿评分方法进行了基准测试。为确保性能评估的全面性,我们采用了多样化的LLMs进行实验。研究表明,对单个评分方法进行校准对于确保风险感知的下游决策至关重要。基于"没有单一评分方法在所有情况下都表现最佳"的发现,我们提出了一个多评分框架,该框架通过组合不同评分方法,在所有数据集上均取得了最优性能。我们进一步引入了高性价比的多评分方案,该方案在显著降低计算开销的同时,能够匹配甚至超越成本更高的检测方法。