DifFRACT: Diffusion Feature Reconstruction and Attribution for Circuit Tracing

Mechanistic interpretability seeks to explain neural network behavior by decomposing model computations into interpretable features and circuits. While transcoder-based circuit tracing has recently enabled detailed causal analyses of large language models, multimodal diffusion transformers for image generation remain comparatively opaque. We still lack tools for understanding how semantic information propagates across denoising steps and how text and image representations interact within double-stream MM-DiT architectures. Existing methods provide only partial insight: attention maps expose a limited view of token interactions, while sparse autoencoders can discover interpretable features but do not directly reveal how these features are transformed and composed through nonlinear MLP layers. In this work, we extend transcoder-based circuit tracing to multimodal diffusion transformers. We train timestep-conditioned transcoders that faithfully approximate the input-output behavior of MLP sublayers in FLUX.1[schnell]. By replacing MLPs with transcoders and linearizing the remaining computation, we obtain exact feature-to-feature attribution and recover compact, interpretable circuits. Empirically, our transcoders match or slightly outperform sparse autoencoders on the sparsity-faithfulness tradeoff. The resulting circuits reveal mechanisms underlying attribute binding and cross-stream semantic propagation, and provide causal explanations for systematic generation errors. Moreover, circuit-guided interventions are substantially more precise and effective than standard SAE-based steering. Our results demonstrate that transcoder-based circuit analysis is feasible for state-of-the-art diffusion transformers and provides a powerful framework for understanding and controlling multimodal generative models. The code is available at https://github.com/Artalmaz31/DifFRACT

翻译：机制可解释性旨在通过将模型计算分解为可解释的特征与电路来解释神经网络行为。尽管基于转录器的电路追踪已能对大型语言模型进行细致的因果分析，但用于图像生成的多模态扩散Transformer仍相对不透明。我们仍缺乏工具来理解语义信息如何在去噪步骤间传播，以及文本与图像表征如何在双流MM-DiT架构中交互。现有方法仅提供部分洞见：注意力图仅展示有限的令牌交互视角，而稀疏自编码器虽能发现可解释特征，却无法直接揭示这些特征如何通过非线性MLP层被转换与组合。本工作中，我们将基于转录器的电路追踪推广至多模态扩散Transformer。我们训练了时间步条件转录器，其能忠实逼近FLUX.1[schnell]中MLP子层的输入输出行为。通过用转录器替换MLP并线性化剩余计算，我们获得精确的特征间归因，并恢复出紧凑的可解释电路。实验表明，我们的转录器在稀疏-忠实权衡上达到或略优于稀疏自编码器。所得电路揭示了属性绑定与跨流语义传播的底层机制，并为系统性生成错误提供了因果解释。此外，基于电路的干预比标准SAE导向的操控更为精准有效。我们的结果表明，基于转录器的电路分析对最先进的扩散Transformer是可行的，并为理解与控制多模态生成模型提供了强大框架。代码开源于https://github.com/Artalmaz31/DifFRACT