We frame the problem of unifying representations in neural models as one of sparse model recovery and introduce a framework that extends sparse autoencoders (SAEs) to lifted spaces and infinite-dimensional function spaces, enabling mechanistic interpretability of large neural operators (NO). While the Platonic Representation Hypothesis suggests that neural networks converge to similar representations across architectures, the representational properties of neural operators remain underexplored despite their growing importance in scientific computing. We compare the inference and training dynamics of SAEs, lifted-SAE, and SAE neural operators. We highlight how lifting and operator modules introduce beneficial inductive biases, enabling faster recovery, improved recovery of smooth concepts, and robust inference across varying resolutions, a property unique to neural operators.
翻译:我们将神经网络模型中表示统一问题构建为稀疏模型恢复问题,并提出了一个将稀疏自编码器(SAEs)扩展到提升空间和无限维函数空间的框架,从而实现对大型神经算子(NO)的机制可解释性。尽管柏拉图表示假说认为神经网络在不同架构中会收敛到相似的表示,但神经算子的表示特性在科学计算中日益重要的情况下仍未得到充分探索。我们比较了SAEs、提升SAE和SAE神经算子的推理与训练动态。我们重点阐述了提升和算子模块如何引入有益的归纳偏置,从而实现更快的恢复、改进的平滑概念恢复,以及在不同分辨率下的鲁棒推理——这是神经算子独有的特性。