Mechanistic interpretability seeks to understand how Large Language Models (LLMs) represent and process information. Recent approaches based on dictionary learning and transcoders enable representing model computation in terms of sparse, interpretable features and their interactions, giving rise to feature attribution graphs. However, these graphs are often large and redundant, limiting their interpretability in practice. Cross-Layer Transcoders (CLTs) address this issue by sharing features across layers while preserving layer-specific decoding, yielding more compact representations, but remain difficult to train and analyze at scale. We introduce an open-source library for end-to-end training and interpretability of CLTs. Our framework integrates scalable distributed training with model sharding and compressed activation caching, a unified automated interpretability pipeline for feature analysis and explanation, attribution graph computation using Circuit-Tracer, and a flexible visualization interface. This provides a practical and unified solution for scaling CLT-based mechanistic interpretability. Our code is available at: https://github.com/LLM-Interp/CLT-Forge.
翻译:机制可解释性旨在理解大型语言模型(LLM)如何表示和处理信息。基于字典学习与转录器的最新方法,能够通过稀疏、可解释的特征及其相互作用来表示模型计算,从而生成特征归因图。然而,这些图通常规模庞大且冗余,限制了其在实际中的可解释性。跨层转录器(CLT)通过在层间共享特征的同时保留层特定的解码方式来解决该问题,从而获得更紧凑的表示,但其在规模化训练与分析方面仍存在困难。我们提出一个用于CLT端到端训练与可解释性的开源库。该框架集成了带有模型分片与压缩激活缓存的可扩展分布式训练、用于特征分析与解释的统一自动化可解释性流水线、基于Circuit-Tracer的归因图计算,以及灵活的可视化接口,为规模化CLT机制可解释性提供了实用且统一的解决方案。我们的代码已开源:https://github.com/LLM-Interp/CLT-Forge。