Mechanistic Interpretability with Sparse Autoencoder Neural Operators

from arxiv, Tolooshams and Shen has equal contribution. Preprint. Earlier version was presented as Oral and Extended Abstract at the Workshop on Unifying Representations in Neural Models (UniReps 2025) at NeurIPS

We introduce sparse autoencoder neural operators (SAE-NOs), a new class of sparse autoencoders that operate directly in infinite-dimensional function spaces. We generalize the linear representation hypothesis to a functional representation hypothesis, enabling concept learning beyond vector-valued representations. Unlike standard SAEs that employ multi-layer perceptrons (SAE-MLP) to each concept with a scalar activation, we introduce and formalize sparse autoencoder neural operators (SAE-NOs), which extend vector-valued representations to functional ones. We instantiate this framework as SAE Fourier neural operators (SAE-FNOs), parameterizing concepts as integral operators in the Fourier domain. We show that this functional parameterization fundamentally shapes learned concepts, leading to improved stability with respect to sparsity level, robustness to distribution shifts, and generalization across discretizations. We show that SAE-FNO is more efficient in concept utilization across data population and more effective in extracting localized patterns from data. We show that convolutional SAEs (SAE-CNNs) do not generalize their sparse representations to unseen input resolutions, whereas SAE-FNOs operate across resolutions and reliably recover the underlying representations. Our results demonstrate that moving from fixed-dimensional to functional representations extends sparse autoencoders from detectors of concept presence to models that capture the underlying structure of the data, highlighting parameterization as a central driver of interpretability and generalization.

翻译：本文提出稀疏自编码神经算子（SAE-NOs），这是一种直接在无限维函数空间上操作的新型稀疏自编码器。我们将线性表示假设推广为函数表示假设，从而实现了超越向量值表示的概念学习。与采用多层感知器（SAE-MLP）对每个概念进行标量激活的标准稀疏自编码器不同，我们提出并形式化了稀疏自编码神经算子（SAE-NOs），将向量值表示扩展为函数值表示。我们将该框架实例化为SAE傅里叶神经算子（SAE-FNOs），通过傅里叶域中的积分算子对概念进行参数化。研究表明，这种函数参数化从根本上塑造了学习到的概念，从而提升了稀疏化水平下的稳定性、分布偏移的鲁棒性以及跨离散化设置的泛化能力。我们证明SAE-FNO在数据群体中的概念利用效率更高，且能更有效地从数据中提取局部模式。实验表明卷积稀疏自编码器（SAE-CNNs）无法将其稀疏表示泛化至未见过的输入分辨率，而SAE-FNOs能够跨分辨率操作并可靠地恢复底层表示。我们的研究结果表明，从固定维度表示转向函数表示，使稀疏自编码器从概念存在性检测器扩展为能够捕捉数据底层结构的模型，这凸显了参数化作为可解释性与泛化能力核心驱动因素的重要性。