Mechanistic Indicators of Understanding in Large Language Models

Large language models (LLMs) are often portrayed as merely imitating linguistic patterns without genuine understanding. We argue that recent findings in mechanistic interpretability (MI), the emerging field probing the inner workings of LLMs, render this picture increasingly untenable--but only once those findings are integrated within a theoretical account of understanding. We propose a tiered framework for thinking about understanding in LLMs and use it to synthesize the most relevant findings to date. The framework distinguishes three hierarchical varieties of understanding, each tied to a corresponding level of computational organization: conceptual understanding emerges when a model forms "features" as directions in latent space, learning connections between diverse manifestations of a single entity or property; state-of-the-world understanding emerges when a model learns contingent factual connections between features and dynamically tracks changes in the world; principled understanding emerges when a model ceases to rely on memorized facts and discovers a compact "circuit" connecting these facts. Across these tiers, MI uncovers internal organizations that can underwrite understanding-like unification. However, these also diverge from human cognition in their parallel exploitation of heterogeneous mechanisms. Fusing philosophical theory with mechanistic evidence thus allows us to transcend binary debates over whether AI understands, paving the way for a comparative, mechanistically grounded epistemology that explores how AI understanding aligns with--and diverges from--our own.

翻译：大型语言模型常被描述为仅模仿语言模式而缺乏真正理解能力。我们认为，机械可解释性这一新兴领域对大型语言模型内部工作机制的探索发现，使得这种观点日益站不住脚——但前提是这些发现需要整合到理解能力的理论框架中。我们提出了一个分层框架来思考大型语言模型的理解能力，并以此综合迄今最相关的研究发现。该框架区分了三个层次的理解类型，每种类型对应相应的计算组织层级：当模型在潜在空间中形成作为方向向量的"特征"，并学习单一实体或属性不同表现形式之间的关联时，概念理解随之产生；当模型学习特征之间的偶然事实关联并动态追踪世界变化时，世界状态理解随之产生；当模型不再依赖记忆事实，而是发现连接这些事实的紧凑"电路"时，原理性理解随之产生。在这些层级中，机械可解释性揭示了能够支撑类理解统一性的内部组织机制。然而，这些机制通过并行利用异构处理方式，也展现出与人类认知的差异。因此，将哲学理论与机械证据相融合，使我们能够超越关于人工智能是否具备理解能力的二元争论，为建立基于机制的比较认识论铺平道路，从而深入探索人工智能理解与人类理解的契合点与分歧点。