The complex and unpredictable nature of deep neural networks prevents their safe use in many high-stakes applications. There have been many techniques developed to interpret deep neural networks, but all have substantial limitations. Algorithmic tasks have proven to be a fruitful test ground for interpreting a neural network end-to-end. Building on previous work, we completely reverse engineer fully connected one-hidden layer networks that have ``grokked'' the arithmetic of the permutation groups $S_5$ and $S_6$. The models discover the true subgroup structure of the full group and converge on neural circuits that decompose the group arithmetic using the permutation group's subgroups. We relate how we reverse engineered the model's mechanisms and confirmed our theory was a faithful description of the circuit's functionality. We also draw attention to current challenges in conducting interpretability research by comparing our work to Chughtai et al. [4] which alleges to find a different algorithm for this same problem.
翻译:深度神经网络复杂且不可预测的特性阻碍了其在许多高风险应用中的安全使用。尽管已有多种技术被开发用于解释深度神经网络,但这些方法均存在显著局限性。算法任务已被证明是端到端解释神经网络的富有成效的测试平台。基于先前研究,我们完整逆向分析了已“领悟”置换群$S_5$与$S_6$算术运算的全连接单隐藏层网络。模型能够识别完整群的真实子群结构,并通过利用置换群的子群构建神经回路来实现群运算的分解。本文阐述了如何逆向解析模型的运行机制,并验证了所提理论对电路功能描述的准确性。通过将本研究与Chughtai等人[4]的工作进行对比——后者声称针对同一问题发现了不同的算法——我们也对当前可解释性研究面临的挑战进行了探讨。