The surge in applications of large language models (LLMs) has prompted concerns about the generation of misleading or fabricated information, known as hallucinations. Therefore, detecting hallucinations has become critical to maintaining trust in LLM-generated content. A primary challenge in learning a truthfulness classifier is the lack of a large amount of labeled truthful and hallucinated data. To address the challenge, we introduce HaloScope, a novel learning framework that leverages the unlabeled LLM generations in the wild for hallucination detection. Such unlabeled data arises freely upon deploying LLMs in the open world, and consists of both truthful and hallucinated information. To harness the unlabeled data, we present an automated membership estimation score for distinguishing between truthful and untruthful generations within unlabeled mixture data, thereby enabling the training of a binary truthfulness classifier on top. Importantly, our framework does not require extra data collection and human annotations, offering strong flexibility and practicality for real-world applications. Extensive experiments show that HaloScope can achieve superior hallucination detection performance, outperforming the competitive rivals by a significant margin. Code is available at https://github.com/deeplearningwisc/haloscope.
翻译:随着大语言模型(LLM)应用的激增,其可能生成误导性或捏造信息(即幻觉)的问题引发了广泛担忧。因此,检测幻觉对于维持LLM生成内容的可信度至关重要。学习真实性分类器的一个主要挑战在于缺乏大量标注的真实与幻觉数据。为解决这一问题,我们提出了HaloScope——一种新颖的学习框架,该框架利用现实场景中未标注的LLM生成内容进行幻觉检测。此类未标注数据在LLM部署于开放世界时自然产生,其中同时包含真实信息与幻觉信息。为有效利用未标注数据,我们提出了一种自动成员估计评分方法,用于区分未标注混合数据中的真实生成与虚假生成,从而在此基础上训练二元真实性分类器。重要的是,本框架无需额外数据收集与人工标注,为实际应用提供了强大的灵活性与实用性。大量实验表明,HaloScope能够实现卓越的幻觉检测性能,显著优于现有竞争方法。代码发布于https://github.com/deeplearningwisc/haloscope。