While Large Language Models (LLMs) have shown exceptional performance in various tasks, their (arguably) most prominent drawback is generating inaccurate or false information with a confident tone. In this paper, we hypothesize that the LLM's internal state can be used to reveal the truthfulness of a statement. Therefore, we introduce a simple yet effective method to detect the truthfulness of LLM-generated statements, which utilizes the LLM's hidden layer activations to determine the veracity of statements. To train and evaluate our method, we compose a dataset of true and false statements in six different topics. A classifier is trained to detect which statement is true or false based on an LLM's activation values. Specifically, the classifier receives as input the activation values from the LLM for each of the statements in the dataset. Our experiments demonstrate that our method for detecting statement veracity significantly outperforms even few-shot prompting methods, highlighting its potential to enhance the reliability of LLM-generated content and its practical applicability in real-world scenarios.
翻译:尽管大型语言模型(LLMs)在各项任务中展现出卓越性能,但其(可争论的)最突出缺陷在于生成不准确或虚假信息时带有自信语气。本文提出假设:LLM的内部状态可用于揭示陈述的真实性。为此,我们引入一种简单而有效的方法来检测LLM生成语句的真伪,该方法利用LLM隐藏层激活值判断语句的可靠性。为训练和评估该方法,我们构建了包含六个不同主题的真假陈述数据集。通过训练分类器,基于LLM对各语句的激活值识别真假陈述。具体而言,分类器将LLM针对数据集中每个语句的激活值作为输入。实验表明,我们提出的语句真实性检测方法显著优于甚至少样本提示方法,凸显了其在提升LLM生成内容可靠性及现实场景实用价值方面的潜力。