Neural networks represent more features than they have dimensions via superposition, forcing features to share representational space. Current methods decompose activations into sparse linear features but discard geometric structure. We develop a theory for studying the geometric structre of features by analyzing the spectra (eigenvalues, eigenspaces, etc.) of weight derived matrices. In particular, we introduce the frame operator $F = WW^\top$, which gives us a spectral measure that describes how each feature allocates norm across eigenspaces. While previous tools could describe the pairwise interactions between features, spectral methods capture the global geometry (``how do all features interact?''). In toy models of superposition, we use this theory to prove that capacity saturation forces spectral localization: features collapse onto single eigenspaces, organize into tight frames, and admit discrete classification via association schemes, classifying all geometries from prior work (simplices, polygons, antiprisms). The spectral measure formalism applies to arbitrary weight matrices, enabling diagnosis of feature localization beyond toy settings. These results point toward a broader program: applying operator theory to interpretability.
翻译:神经网络通过叠加表示比其维度更多的特征,迫使特征共享表示空间。现有方法将激活分解为稀疏线性特征,但丢弃了几何结构。我们通过分析权重导出矩阵的谱(特征值、特征空间等),发展了一种研究特征几何结构的理论。具体而言,我们引入框架算子$F = WW^\top$,它提供了一个谱测度,用于描述每个特征如何在特征空间之间分配范数。以往的工具仅能描述特征间的成对相互作用,而谱方法能够捕捉全局几何结构("所有特征如何相互作用?")。在叠加的玩具模型中,我们利用该理论证明容量饱和会迫使谱局部化:特征坍缩到单个特征空间,组织成紧框架,并通过关联方案允许离散分类,从而对先前工作中的所有几何结构(单纯形、多边形、反棱柱)进行分类。谱测度形式体系适用于任意权重矩阵,使得在玩具模型之外诊断特征局部化成为可能。这些结果指向一个更广泛的研究方向:应用算子理论于可解释性研究。