Uncertainty quantification has emerged as an effective approach to closed-book hallucination detection for LLMs, but existing methods are largely designed for short-form outputs and do not generalize well to long-form generation. We introduce a taxonomy for fine-grained uncertainty quantification in long-form LLM outputs that distinguishes methods by design choices at three stages: response decomposition, unit-level scoring, and response-level aggregation. We formalize several families of consistency-based black-box scorers, providing generalizations and extensions of existing methods. In our experiments across multiple LLMs and datasets, we find 1) claim-response entailment consistently performs better or on par with more complex claim-level scorers, 2) claim-level scoring generally yields better results than sentence-level scoring, and 3) uncertainty-aware decoding is highly effective for improving the factuality of long-form outputs. Our framework clarifies relationships between prior methods, enables apples-to-apples comparisons, and provides practical guidance for selecting components for fine-grained UQ.
翻译:不确定性量化已成为检测大语言模型闭卷幻觉的有效方法,但现有方法主要针对短文本输出设计,难以推广至长文本生成。我们提出了一种针对长文本大语言模型输出的细粒度不确定性量化分类体系,该体系通过三个设计阶段区分不同方法:响应分解、单元级评分和响应级聚合。我们形式化了几类基于一致性的黑盒评分器,对现有方法进行了推广和扩展。在针对多个大语言模型和数据集的实验中,我们发现:1) 声明-响应蕴含评分始终优于或等同于更复杂的声明级评分器;2) 声明级评分通常优于句子级评分;3) 不确定性感知解码能显著提升长文本输出的事实性。我们的框架厘清了现有方法之间的关系,实现了公平比较,并为选择细粒度不确定性量化组件提供了实用指导。