While Large Language Models (LLMs) have shown exceptional performance in various tasks, one of their most prominent drawbacks is generating inaccurate or false information with a confident tone. In this paper, we provide evidence that the LLM's internal state can be used to reveal the truthfulness of statements. This includes both statements provided to the LLM, and statements that the LLM itself generates. Our approach is to train a classifier that outputs the probability that a statement is truthful, based on the hidden layer activations of the LLM as it reads or generates the statement. Experiments demonstrate that given a set of test sentences, of which half are true and half false, our trained classifier achieves an average of 71\% to 83\% accuracy labeling which sentences are true versus false, depending on the LLM base model. Furthermore, we explore the relationship between our classifier's performance and approaches based on the probability assigned to the sentence by the LLM. We show that while LLM-assigned sentence probability is related to sentence truthfulness, this probability is also dependent on sentence length and the frequencies of words in the sentence, resulting in our trained classifier providing a more reliable approach to detecting truthfulness, highlighting its potential to enhance the reliability of LLM-generated content and its practical applicability in real-world scenarios.
翻译:尽管大语言模型在各种任务中表现出色,但其最显著的缺点之一是以自信的口吻生成不准确或虚假的信息。本文提出证据表明,LLM的内部状态可用于揭示陈述的真实性——这既包括提供给LLM的陈述,也包括LLM自身生成的陈述。我们的方法是:基于LLM在读取或生成陈述时的隐藏层激活值,训练一个分类器来输出该陈述真实性的概率。实验表明,给定一组测试句子(其中一半真实、一半虚假),依据LLM基础模型的不同,我们训练的分类器对句子真假标记的平均准确率可达71%至83%。此外,我们探究了分类器性能与基于LLM分配给句子概率的方法之间的关系。研究显示,尽管LLM分配的句子概率与句子真实性相关,但这种概率同时取决于句子长度和句中词语的出现频率,因此我们训练的分类器为检测真实性提供了更可靠的方法,这凸显了其提升LLM生成内容可靠性的潜力以及在真实场景中的实际应用价值。