We study how to characterize and predict the truthfulness of texts generated from large language models (LLMs), which serves as a crucial step in building trust between humans and LLMs. Although several approaches based on entropy or verbalized uncertainty have been proposed to calibrate model predictions, these methods are often intractable, sensitive to hyperparameters, and less reliable when applied in generative tasks with LLMs. In this paper, we suggest investigating internal activations and quantifying LLM's truthfulness using the local intrinsic dimension (LID) of model activations. Through experiments on four question answering (QA) datasets, we demonstrate the effectiveness ohttps://info.arxiv.org/help/prep#abstractsf our proposed method. Additionally, we study intrinsic dimensions in LLMs and their relations with model layers, autoregressive language modeling, and the training of LLMs, revealing that intrinsic dimensions can be a powerful approach to understanding LLMs.
翻译:我们研究如何表征和预测大语言模型(LLM)生成文本的真实性,这是建立人机信任的关键步骤。尽管已有基于熵或语言不确定性校准模型预测的方法,但这些方法通常难以处理、对超参数敏感,且在LLM生成任务中应用时可靠性不足。本文提出通过模型内部激活的局部内在维度(LID)来量化LLM生成内容的真实性。在四个问答(QA)数据集上的实验验证了该方法的有效性。此外,我们进一步探究了内在维度与模型层、自回归语言建模及LLM训练过程的关系,揭示内在维度可作为理解LLM的有效途径。