We provide evidence that eigenanalysis of the empirical neural tangent kernel (eNTK) can surface the features used by trained neural networks. Across three standard toy models for mechanistic interpretability, Toy Models of Superposition (TMS), a 1-layer MLP trained on modular addition and a 1-layer Transformer trained on modular addition, we find that top eigenspaces of the eNTK align with ground-truth features. In TMS, the eNTK recovers the ground-truth features in both the sparse (high superposition) and dense regimes. In modular arithmetic, the eNTK can be used to recover Fourier feature families. Moreover, we provide evidence that a layerwise eNTK localizes features to specific layers and that the evolution of the eNTK spectrum can be used to diagnose the grokking phase transition. These results suggest that eNTK analysis may provide a practical handle for feature discovery and for detecting phase changes in small models.
翻译:我们提供证据表明,经验神经切核(eNTK)的特征分析能够揭示训练后神经网络所使用的特征。在三个用于机制可解释性的标准玩具模型——叠加玩具模型(TMS)、在模加法上训练的单层MLP以及在模加法上训练的单层Transformer中,我们发现eNTK的顶部特征空间与真实特征对齐。在TMS中,eNTK在稀疏(高叠加)和稠密两种机制下均能恢复真实特征。在模算术中,eNTK可用于恢复傅里叶特征族。此外,我们提供的证据表明,分层eNTK可将特征定位到特定层,并且eNTK谱的演化可用于诊断“顿悟”相变。这些结果表明,eNTK分析可能为特征发现和小型模型中的相变检测提供一种实用的方法。