Large Language Models (LLMs) demonstrate an impressive capacity to recall a vast range of factual knowledge. However, understanding their underlying reasoning and internal mechanisms in exploiting this knowledge remains a key research area. This work unveils the factual information an LLM represents internally for sentence-level claim verification. We propose an end-to-end framework to decode factual knowledge embedded in token representations from a vector space to a set of ground predicates, showing its layer-wise evolution using a dynamic knowledge graph. Our framework employs activation patching, a vector-level technique that alters a token representation during inference, to extract encoded knowledge. Accordingly, we neither rely on training nor external models. Using factual and common-sense claims from two claim verification datasets, we showcase interpretability analyses at local and global levels. The local analysis highlights entity centrality in LLM reasoning, from claim-related information and multi-hop reasoning to representation errors causing erroneous evaluation. On the other hand, the global reveals trends in the underlying evolution, such as word-based knowledge evolving into claim-related facts. By interpreting semantics from LLM latent representations and enabling graph-related analyses, this work enhances the understanding of the factual knowledge resolution process.
翻译:大型语言模型(LLMs)展现出令人印象深刻的记忆海量事实知识的能力。然而,理解其利用这些知识的底层推理与内部机制仍是一个关键研究领域。本研究揭示了LLM在句子级主张验证中内部表征的事实信息。我们提出一种端到端框架,将嵌入于词元表征中的事实知识从向量空间解码为一组基础谓词,并利用动态知识图谱展示其逐层演化过程。该框架采用激活修补技术——一种在推理过程中改变词元表征的向量级方法——来提取编码知识。因此,我们既不依赖训练,也不借助外部模型。基于两个主张验证数据集中的事实性与常识性主张,我们在局部与全局层面进行了可解释性分析。局部分析揭示了实体在LLM推理中的中心性,涵盖从主张相关信息、多跳推理到导致错误评估的表征误差。全局分析则揭示了底层演化的趋势,例如基于词汇的知识如何演化为与主张相关的事实。通过解读LLM潜在表征的语义并实现图相关分析,本研究深化了对事实知识解析过程的理解。