Extracting sparse circuits from billion-parameter transformers is constrained by $O(2^n)$ search cost and pervasive feature reuse across co-active pathways. Hierarchical Attribution Graph Decomposition (HAGD) addresses this through four stages: cross-layer transcoder training, spectral coarsening of attribution graphs, graph-neural-network (GNN)-guided hierarchical traversal, and causal intervention verification, reducing worst-case complexity to $O(n^2 \log n)$. Per-layer transcoders trained on the RedPajama corpus yield monosemantic dictionaries; gradient-activation products form weighted attribution graphs; normalized-Laplacian spectral clustering builds multi-resolution hierarchies; an attention-based GNN assigns circuit-membership scores at successive coarsening stages. Evaluation spans GPT-2 (117M-774M), Pythia (1.4B-6.9B), and Llama (7B-70B) across modular arithmetic, parity computation, integer sorting, coreference resolution (WinoGrande), commonsense reasoning (HellaSwag), and factual recall. Behavioral preservation reaches 91\% ($\pm$2.3\%) on modular arithmetic with 49-347-node circuits, while ACDC exhausts memory beyond 1.4B parameters. Cross-architecture transfer coefficients span 0.38-0.82, with within-family pairs (Llama-7B $\to$ Llama-70B) attaining 0.82. Limitations include omitted attention-head circuits, 15-20\% unexplained reconstruction variance, ablation-based validation circularity, and uncertain interpretability of circuits exceeding several hundred nodes.
翻译:暂无翻译