While Large Language Models (LLMs) have emerged as powerful foundational models to solve a variety of tasks, they have also been shown to be prone to hallucinations, i.e., generating responses that sound confident but are actually incorrect or even nonsensical. Existing hallucination detectors propose a wide range of empirical scoring rules, but their performance varies across models and datasets, and it is hard to determine which ones to rely on in practice or to treat as a reliable detector. In this work, we formulate the problem of detecting hallucinations as a hypothesis testing problem and draw parallels with the problem of out-of-distribution detection in machine learning models. We then propose a multiple-testing-inspired method that systematically aggregates multiple evaluation scores via conformal p-values, enabling calibrated detection with controlled false alarm rate. Extensive experiments across diverse models and datasets validate the robustness of our approach against state-of-the-art methods.
翻译:尽管大语言模型(LLMs)已成为解决各类任务的重要基础模型,但它们也表现出易产生幻觉的倾向,即生成看似自信实则错误甚至无意义的回复。现有幻觉检测器提出了多种经验评分规则,但其性能在不同模型和数据集间存在差异,实际应用中难以确定应依赖哪些规则或将其视为可靠检测器。本研究将幻觉检测问题形式化为假设检验问题,并与机器学习模型中的分布外检测问题进行类比。我们随后提出一种基于多重检验启发的方法,通过共形p值系统聚合多个评估分数,实现对误报率受控的校准检测。跨多个模型和数据集的广泛实验验证了该方法相对于现有最优技术的稳健性。