We provide evidence that eigenanalysis of the empirical neural tangent kernel (eNTK) can surface feature directions in trained neural networks. Across three increasingly realistic settings -- a 1-layer MLP trained on modular addition, a 1-layer Transformer trained on modular addition and the pretrained language model Gemma-3-270M -- we show that top eigenspaces of the eNTK align with ground-truth or interpretable features. In the modular arithmetic examples, top eNTK eigenspaces align with the Fourier features used by the MLP and the Fourier features at seed-dependent frequencies used by the Transformer to implement known ground-truth algorithms. Moreover, the alignment of the relevant subspaces evolves over training, with its first derivative peaking near the onset of grokking. For Gemma-3-270M, we compute top eNTK eigendirections on a dataset of TinyStories context windows and check their alignment with an automatically-generated set of parts-of-speech and other grammatical feature directions. We find that the alignment of eNTK eigendirections with grammar features outperforms a same-budget baseline of PCA on model activations. These results suggest that eNTK eigenanalysis may provide a new handle towards identifying features in trained models for mechanistic interpretability.
翻译:我们提供证据表明,对经验神经正切核(eNTK)进行特征分析能够揭示已训练神经网络中的特征方向。通过三个逐渐逼近真实场景的案例——在模加法任务上训练的单层MLP、单层Transformer,以及预训练语言模型Gemma-3-270M——我们展示eNTK的顶部特征空间与真实标签或可解释特征存在对齐关系。在模算术示例中,eNTK的顶部特征空间分别与MLP使用的傅里叶特征、以及Transformer为实现已知真实算法而采用的种子依赖频率下的傅里叶特征对齐。此外,相关子空间的对齐程度随训练过程演化,其一阶导数在"顿悟"现象出现初期达到峰值。对于Gemma-3-270M模型,我们在TinyStories上下文窗口数据集上计算了顶部eNTK特征方向,并检验其与自动生成的词性标注及其他语法特征方向的对齐程度。研究发现,eNTK特征方向与语法特征的对齐效果优于同等计算预算下基于模型激活的PCA基线方法。这些结果表明,eNTK特征分析或可为机械可解释性领域识别已训练模型中的特征提供新途径。