Recent advancements in the capabilities of large language models (LLMs) have paved the way for a myriad of groundbreaking applications in various fields. However, a significant challenge arises as these models often "hallucinate", i.e., fabricate facts without providing users an apparent means to discern the veracity of their statements. Uncertainty estimation (UE) methods are one path to safer, more responsible, and more effective use of LLMs. However, to date, research on UE methods for LLMs has been focused primarily on theoretical rather than engineering contributions. In this work, we tackle this issue by introducing LM-Polygraph, a framework with implementations of a battery of state-of-the-art UE methods for LLMs in text generation tasks, with unified program interfaces in Python. Additionally, it introduces an extendable benchmark for consistent evaluation of UE techniques by researchers, and a demo web application that enriches the standard chat dialog with confidence scores, empowering end-users to discern unreliable responses. LM-Polygraph is compatible with the most recent LLMs, including BLOOMz, LLaMA-2, ChatGPT, and GPT-4, and is designed to support future releases of similarly-styled LMs.
翻译:近年来,大型语言模型能力的提升为各领域众多开创性应用铺平了道路。然而,一个重大挑战随之显现:这些模型经常"编造事实"(即产生幻觉),而未能为用户提供判断其陈述真实性的直观手段。不确定性估计方法是实现更安全、更负责任且更有效使用语言模型的关键路径之一。但迄今为止,针对语言模型的不确定性估计方法研究主要聚焦于理论贡献而非工程实践。为解决此问题,我们提出LM-Polygraph框架——该框架集成了文本生成任务中一系列前沿的不确定性估计方法实现,并统一采用Python编程接口。此外,该框架还包含用于研究人员对不确定性估计技术进行一致性评估的可扩展基准测试,以及一个演示网页应用——通过为标准聊天对话赋予置信度分数,使最终用户能够甄别不可靠回复。LM-Polygraph兼容当前最先进的语言模型(包括BLOOMz、LLaMA-2、ChatGPT和GPT-4),并设计支持未来推出的同类语言模型。