Learned Hallucination Detection in Black-Box LLMs using Token-level Entropy Production Rate

Hallucinations in Large Language Model (LLM) outputs for Question Answering (QA) tasks can critically undermine their real-world reliability. This paper introduces a methodology for robust, one-shot hallucination detection, specifically designed for scenarios with limited data access, such as interacting with black-box LLM APIs that typically expose only a few top candidate log-probabilities per token. Our approach derives uncertainty indicators directly from these readily available log-probabilities generated during non-greedy decoding. We first derive an Entropy Production Rate (EPR) that offers baseline performance, later augmented with supervised learning. Our learned model leverages the entropic contributions of the accessible top-ranked tokens within a single generated sequence, without multiple re-runs per query. Evaluated across diverse QA datasets and multiple LLMs, this estimator significantly improves token-level hallucination detection over state-of-the-art methods. Crucially, high performance is demonstrated using only the typically small set of available log-probabilities (e.g., top-10 per token), confirming its practical efficiency and suitability for API-constrained deployments. This work provides a lightweight technique to enhance the trustworthiness of LLM responses, at the token level, after a single generation pass, for QA and Retrieval-Augmented Generation (RAG) systems. Our experiments confirmed the performance of our method against existing approaches on public dataset as well as for a financial framework analyzing annual company reports.

翻译：在问答任务中，大语言模型输出中的幻觉会严重损害其实际应用的可靠性。本文提出了一种鲁棒的单次幻觉检测方法，该方法专为数据访问受限的场景设计，例如与黑盒大语言模型API交互时，这些API通常仅暴露每个词元的前几个候选对数概率。我们的方法直接从非贪婪解码过程中生成的这些现成可用的对数概率中推导不确定性指标。我们首先推导了提供基线性能的熵产生率，随后通过监督学习对其进行增强。我们的学习模型利用单个生成序列中可访问的排名靠前词元的熵贡献，而无需对每个查询进行多次重新运行。通过在多样化的问答数据集和多个大语言模型上进行评估，该估计器在词元级幻觉检测方面显著优于现有最先进方法。关键在于，仅使用通常可用的少量对数概率集合（例如，每个词元的前10个）即可展示出高性能，这证实了其在实际部署中的效率及对API受限环境的适用性。这项工作提供了一种轻量级技术，可在单次生成后，在词元级别增强问答和检索增强生成系统的大语言模型响应的可信度。我们的实验证实了该方法在公开数据集以及用于分析公司年度报告的金融框架中，相较于现有方法的性能表现。