Towards Verifiable Transformers: Solver-Checkable Circuit Explanations

Mechanistic interpretability often identifies circuits inside Transformer models, but explanations of those circuits are usually validated through examples, ablations, and manual reasoning. This leaves a gap between finding a plausible circuit and proving what the circuit does. We introduce Verifiable Transformers, a framework for converting task-localized Transformer circuits into bounded, solver-checkable claims. Given a behavior, a finite task domain, and a candidate-token projection, we extract a task circuit and verify properties such as projected functional equivalence, edge necessity, task-relevant invariance, and final-residual robustness. Direct verification encodes the extracted circuit itself into an SMT solver. When a circuit contains operators that are not exactly or tractably encodable, surrogate-mediated verification fits an SMT-encodable surrogate, validates it against the extracted circuit over the bounded domain, and verifies symbolic explanations against the surrogate. We instantiate direct verification with a GPT-style architecture using Signed L1 BandNorm, sparsemax attention, and LeakyReLU. On small symbolic sequence tasks, we train an SMT-representable Transformer, extract sparse circuits for quote closing and bracket type tracking, and exhaustively verify projected functional equivalence, content invariance, edge necessity, and final-residual robustness. At GPT-2 scale, the same operator stack trains stably on OpenWebText, although naive direct SMT verification remains intractable. We also demonstrate surrogate-mediated verification on task-localized circuits with hard-to-encode attention, showing both verified symbolic explanations and solver-generated counterexamples. The goal is not full-model verification, but a concrete path for turning mechanistic circuit explanations into formal propositions that can be proven or refuted.

翻译：机制可解释性通常识别Transformer模型内部的电路，但这些电路的解释通常通过示例、消融和手动推理来验证。这导致在发现一个可能的电路与证明电路的功能之间存在差距。我们引入可验证Transformer（Verifiable Transformers），这是一个将任务局部化Transformer电路转换为有界、求解器可验证声明的框架。给定一个行为、一个有限任务域和一个候选令牌投影，我们提取一个任务电路并验证诸如投影功能等价性、边必要性、任务相关不变性和最终残差鲁棒性等属性。直接验证将提取电路本身编码到SMT求解器中。当电路包含无法精确或易处理编码的算子时，代理中介验证拟合一个SMT可编码的代理，在有限域上对提取的电路进行验证，然后针对代理验证符号解释。我们使用带有带符号L1 BandNorm、sparsemax注意力机制和LeakyReLU的GPT架构实例化直接验证。在小型符号序列任务上，我们训练一个SMT可表示的Transformer，为引号闭合和括号类型跟踪提取稀疏电路，并穷举验证投影功能等价性、内容不变性、边必要性和最终残差鲁棒性。在GPT-2尺度上，相同的算子栈在OpenWebText上稳定训练，尽管朴素的直接SMT验证仍然难以处理。我们还展示了在具有难编码注意力的任务局部化电路上的代理中介验证，展示了经过验证的符号解释和求解器生成的反例。目标不是全模型验证，而是为将机制电路解释转化为可证明或反驳的形式命题提供一条具体路径。