The fields of explainable AI and mechanistic interpretability aim to uncover the internal structure of neural networks, with circuit discovery as a central tool for understanding model computations. Existing approaches, however, rely on manual inspection and remain limited to toy tasks. Automated interpretability offers scalability by analyzing isolated features and their activations, but it often misses interactions between features and depends strongly on external LLMs and dataset quality. Transcoders have recently made it possible to separate feature attributions into input-dependent and input-invariant components, providing a foundation for more systematic circuit analysis. Building on this, we propose WeightLens and CircuitLens, two complementary methods that go beyond activation-based analysis. WeightLens interprets features directly from their learned weights, removing the need for explainer models or datasets while matching or exceeding the performance of existing methods on context-independent features. CircuitLens captures how feature activations arise from interactions between components, revealing circuit-level dynamics that activation-only approaches cannot identify. Together, these methods increase interpretability robustness and enhance scalable mechanistic analysis of circuits while maintaining efficiency and quality.
翻译:可解释人工智能与机制可解释性领域致力于揭示神经网络的内部结构,其中电路发现是理解模型计算的核心工具。然而,现有方法依赖人工检查,且仅限于玩具任务。自动化可解释性通过分析孤立特征及其激活实现了可扩展性,但常忽略特征间的相互作用,且高度依赖外部大语言模型与数据集质量。转码器的出现使得将特征归因分解为输入依赖与输入无关的组件成为可能,为更系统的电路分析奠定了基础。基于此,我们提出WeightLens与CircuitLens这两种互补方法,它们超越了基于激活的分析范式。WeightLens直接从学习权重中解释特征,无需解释模型或数据集,并在上下文无关特征任务上达到或超越了现有方法的性能。CircuitLens捕捉特征激活如何通过组件间的相互作用产生,揭示了仅依赖激活的方法无法识别的电路级动态机制。这些方法共同提升了可解释性的鲁棒性,在保持效率与质量的同时,增强了电路的可扩展机制分析能力。