Calibrating Long-form Generations from Large Language Models

To enhance Large Language Models' (LLMs) reliability, calibration is essential -- the model's assessed confidence scores should align with the actual likelihood of its responses being correct. However, current confidence elicitation methods and calibration metrics typically rely on a binary true/false assessment of response correctness. This approach does not apply to long-form generation, where an answer can be partially correct. Addressing this gap, we introduce a unified calibration framework, in which both the correctness of the LLMs' responses and their associated confidence levels are treated as distributions across a range of scores. Within this framework, we develop three metrics to precisely evaluate LLM calibration and further propose two confidence elicitation methods based on self-consistency and self-evaluation. Our experiments, which include long-form QA and summarization tasks, demonstrate that larger models don't necessarily guarantee better calibration, that calibration performance is found to be metric-dependent, and that self-consistency methods excel in factoid datasets. We also find that calibration can be enhanced through techniques such as fine-tuning, integrating relevant source documents, scaling the temperature, and combining self-consistency with self-evaluation. Lastly, we showcase a practical application of our system: selecting and cascading open-source models and ChatGPT to optimize correctness given a limited API budget. This research not only challenges existing notions of LLM calibration but also offers practical methodologies for improving trustworthiness in long-form generation.

翻译：为增强大型语言模型（LLMs）的可靠性，校准至关重要——模型评估的置信度分数应与其响应实际正确的可能性保持一致。然而，当前的置信度获取方法和校准指标通常依赖于对响应正确性的二元真/假评估。这种方法不适用于长文本生成，因为答案可能部分正确。针对这一空白，我们提出了一个统一的校准框架，在该框架中，LLMs响应的正确性及其相关置信度水平均被视为分数区间上的分布。在此框架内，我们开发了三个指标以精确评估LLM的校准，并进一步提出了两种基于自一致性和自我评估的置信度获取方法。我们的实验（包括长文本问答和摘要任务）表明：更大规模的模型不一定能保证更好的校准；校准性能被发现依赖于评估指标；自一致性方法在事实型数据集上表现优异。我们还发现，通过微调、整合相关源文档、缩放温度参数以及结合自一致性与自我评估等技术可增强校准效果。最后，我们展示了系统的一个实际应用：在有限的API预算下，通过选择并级联开源模型与ChatGPT以优化正确性。本研究不仅挑战了LLM校准的现有概念，也为提升长文本生成的可信度提供了实用方法论。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日