Claim-level Uncertainty Quantification (UQ) is a promising approach to mitigate the lack of reliability in Large Language Models (LLMs). We introduce MUCH, the first claim-level UQ benchmark designed for fair and reproducible evaluation of future methods under realistic conditions. It includes 4,873 samples across four European languages (English, French, Spanish, and German) and four instruction-tuned open-weight LLMs. Unlike prior claim-level benchmarks, we release 24 generation logits per token, facilitating the development of future white-box methods without re-generating data. Moreover, in contrast to previous benchmarks that rely on manual or LLM-based segmentation, we propose a new deterministic algorithm capable of segmenting claims using as little as 0.2% of the LLM generation time. This makes our segmentation approach suitable for real-time monitoring of LLM outputs, ensuring that MUCH evaluates UQ methods under realistic deployment constraints. Finally, our evaluations show that current methods still have substantial room for improvement in both performance and efficiency.
翻译:声明级别不确定性量化是一种缓解大型语言模型可靠性不足问题的有效方法。我们提出了MUCH,这是首个为在真实条件下公平且可复现地评估未来方法而设计的声明级别不确定性量化基准。该基准包含4,873个样本,涵盖四种欧洲语言(英语、法语、西班牙语和德语)及四种经过指令微调的开源权重大型语言模型。与以往的声明级别基准不同,我们为每个词元提供24个生成对数概率,无需重新生成数据即可支持未来白盒方法的开发。此外,相较于依赖人工或基于LLM分割的先前基准,我们提出了一种新的确定性算法,仅需消耗LLM生成时间的0.2%即可完成声明分割。这使得我们的分割方法适用于LLM输出的实时监控,确保MUCH能在实际部署约束下评估不确定性量化方法。最终评估表明,现有方法在性能与效率方面仍存在显著提升空间。