When Does Routing Become Interpretable? Causal Probes on Block Attention Residuals

Block Attention Residuals (Block AttnRes) by replace fixed additive residuals with a learned softmax over earlier depth-source representations, surfacing cross-layer routing as an inspectable tensor in the forward pass. This is a tempting interpretability target: information flow normally inferred indirectly is now directly observable. We ask whether such exposure suffices for mechanistic interpretation. We probe two same-scale ($0.6$B) Block AttnRes checkpoints under identical routing-ablation interventions: a vanilla Qwen3 inference-wrapped through a deterministic recency-bias schedule that the codebase admits as a routing-equivalent loading path, and a Block AttnRes Qwen3 trained from scratch with routing as part of optimisation. The wrapped baseline's routing weights are content-independent and reproduce the schedule's analytic prediction. The trained AttnRes checkpoint instead exhibits three localised routing motifs: an embedding-source pathway through early-layer MLP, a current-state pathway through early-layer attention and MLP, and an older-history pathway through late-layer attention. Beyond this stratification, we find a sharp dissociation between average routing mass and causal importance: in both sublayers, the largest mass slice is not the largest causal contribution, and one source family carries appreciable mass with no detectable causal role under intervention. Architectural exposure of routing is therefore necessary but not sufficient for mechanistic interpretation: structured depth routing emerges only when routing has been part of training, and even then, descriptive routing summaries should be treated as candidate hypotheses to be tested by causal interventions, not as evidence of mechanism in their own right.

翻译：块注意力残差（Block Attention Residuals，简称Block AttnRes）通过将固定的加法残差替换为基于较早深度源表示的学习型softmax，在前向传播中将跨层路由暴露为可观测张量。这构成一个诱人的可解释性目标：通常需要间接推断的信息流如今可直接观测。我们探究这种暴露是否足以支撑机制性解释。我们在相同的路由消融干预条件下，对两个同等规模（0.6B参数）的Block AttnRes检查点进行探针实验：一个是通过代码库认可的确定性近因偏差调度方案进行推理封装的原始版Qwen3模型（该调度方案可作为路由等效加载路径），另一个是从头训练且将路由作为优化目标的Block AttnRes版Qwen3模型。封装基线模型的路由权重与输入内容无关，且复现了调度方案的解析预测结果。而经过训练的AttnRes检查点则展现出三种局域化路由模式：通过早期层MLP的嵌入源通路、通过早期层注意力与MLP的当前状态通路，以及通过晚期层注意力的旧历史通路。除这种分层现象外，我们发现路由平均质量与因果重要性之间存在显著分离：在两个子层中，质量最大切片并非因果贡献最大的部分，且有一类源家族在干预条件下携带显著质量却不具备可检测的因果作用。因此，路由的结构化暴露是机制性解释的必要非充分条件：只有在路由作为训练组成部分时，结构化的深度路由才会涌现；即便在此情形下，描述性的路由总结也应被视为需经因果干预检验的候选假设，而非其本身即构成机制证据。