Rational Sparse Autoencoder

Sparse autoencoders (SAEs) are standard tools for mechanistic interpretability, but current SAE families are constrained by fixed encoder nonlinearities such as ReLU, JumpReLU, and TopK. This hard-codes a particular sparsity mechanism into the model and can distort the reconstruction-versus-sparsity trade-off. We introduce the Rational Sparse Autoencoder (RSAE), which replaces the fixed encoder activation with a trainable rational function. Rational activations are flexible enough to uniformly approximate the activation primitives used by existing SAE families on compact domains (for TopK, the thresholded gate obtained after a separating top-k threshold is supplied), while also providing a richer function class for adapting to the observed pre-activation geometry. We realise this idea through a two-stage pipeline: an initialisation procedure that copies the pre-trained baseline SAE weights, plugs in rational coefficients obtained by the relaxed Remez exchange on synthetic data, and calibrates the scale parameters along with the rational coefficients; followed by a fine-tuning step under the standard sparsity-regularised reconstruction objective. Empirically, on residual-stream activations of three open-weight language models and across all three baseline activation families, the RSAE strictly improves on it after the fine-tuning step, both on reconstruction-side metrics and on downstream-behaviour metrics, without sacrificing feature-level interpretability under sparse probing. These gains are consistent across host language models, across baseline activation families, and across the full range of baseline sparsity we tested, while the upgrade itself adds only a handful of scalar parameters per autoencoder and runs in minutes on a single consumer GPU.

翻译：稀疏自编码器（SAEs）是机制可解释性研究的标准工具，但现有SAE系列受限于固定的编码器非线性函数（如ReLU、JumpReLU和TopK）。这会将特定的稀疏机制硬编码到模型中，并扭曲重构与稀疏性之间的权衡。我们提出理性稀疏自编码器（RSAE），用可训练的有理函数替代固定编码器激活函数。有理激活函数具有足够的灵活性，能在紧致域上统一逼近现有SAE系列使用的激活原语（对于TopK，该原语指分离top-k阈值后得到的阈值门控），同时提供更丰富的函数类以适应观测到的预激活几何结构。我们通过两阶段流水线实现这一思想：初始化阶段复制预训练基线SAE的权重，植入通过松弛Remez交换算法在合成数据上得到的有理系数，并校准尺度参数与有理系数；随后在标准稀疏正则化重构目标下进行微调。实验表明，在三个开放权重语言模型的残差流激活上，针对所有三种基线激活族，RSAE在微调后严格优于基线——无论在重构侧指标还是下游行为指标上，均未牺牲稀疏探测下的特征级可解释性。这些改进在宿主语言模型、基线激活族以及我们测试的全范围基线稀疏度上保持一致，而升级本身仅向每个自编码器增加少量标量参数，且可在单个消费级GPU上数分钟内完成。