Mechanistic interpretability has identified functional subgraphs within large language models (LLMs), known as Transformer Circuits (TCs), that appear to implement specific algorithms. Yet we lack a formal, single-pass way to quantify when an active circuit is behaving coherently and thus likely trustworthy. Building on prior systems-theoretic proposals, we specialize a sheaf/cohomology and causal emergence perspective to TCs and introduce the Effective-Information Consistency Score (EICS). EICS combines (i) a normalized sheaf inconsistency computed from local Jacobians and activations, with (ii) a Gaussian EI proxy for circuit-level causal emergence derived from the same forward state. The construction is white-box, single-pass, and makes units explicit so that the score is dimensionless. We further provide practical guidance on score interpretation, computational overhead (with fast and exact modes), and a toy sanity-check analysis. Empirical validation on LLM tasks is deferred.
翻译:机理可解释性研究已在大型语言模型(LLM)中识别出被称为Transformer电路(TC)的功能子图,这些子图似乎实现了特定算法。然而,我们缺乏一种形式化的单次前向传播方法来量化活动电路何时表现出连贯行为,从而可能值得信赖。基于先前的系统理论提议,我们将层/上同调与因果涌现的视角专门应用于TC,并引入了有效信息一致性分数(EICS)。EICS结合了(i)从局部雅可比矩阵和激活值计算得到的归一化层不一致性,与(ii)从同一前向状态推导出的、用于电路级因果涌现的高斯有效信息代理。该构建是白盒、单次前向传播的,并明确了单位,使得分数是无量纲的。我们进一步提供了关于分数解释、计算开销(包含快速和精确模式)以及一个玩具合理性检验分析的实用指南。在LLM任务上的实证验证留待后续工作。