The linear representation hypothesis states that neural network activations encode high-level concepts as linear mixtures. However, under superposition, this encoding is a projection from a higher-dimensional concept space into a lower-dimensional activation space, and a linear decision boundary in the concept space need not remain linear after projection. In this setting, classical sparse coding methods with per-sample iterative inference leverage compressed sensing guarantees to recover latent factors. Sparse autoencoders (SAEs), on the other hand, amortise sparse inference into a fixed encoder, introducing a systematic gap. We show this amortisation gap persists across training set sizes, latent dimensions, and sparsity levels, causing SAEs to fail under out-of-distribution (OOD) compositional shifts. Through controlled experiments that decompose the failure, we identify dictionary learning -- not the inference procedure -- as the binding constraint: SAE-learned dictionaries point in substantially wrong directions, and replacing the encoder with per-sample FISTA on the same dictionary does not close the gap. An oracle baseline proves the problem is solvable with a good dictionary at all scales tested. Our results reframe the SAE failure as a dictionary learning challenge, not an amortisation problem, and point to scalable dictionary learning as the key open problem for sparse inference under superposition.
翻译:线性表征假说认为,神经网络激活将高层概念编码为线性混合。然而在叠加状态下,这种编码是从高维概念空间到低维激活空间的投影,概念空间中的线性决策边界在投影后未必保持线性。在此场景下,经典的每样本迭代推断稀疏编码方法可借助压缩感知理论保证恢复潜在因子。而稀疏自编码器则将稀疏推断摊销为固定编码器,引入了系统性偏差。我们证明这种摊销偏差在训练集规模、潜在维度与稀疏度变化时持续存在,导致SAE在分布外组合偏移场景下失效。通过分解失败原因的受控实验,我们识别出字典学习——而非推断过程——是根本约束:SAE习得的字典指向严重偏离的正确方向,即使用每样本FISTA方法替换编码器也无法弥补该偏差。基线实验证明,在所有测试规模下,使用优质字典均可解决该问题。我们的研究将SAE的失败重新定义为字典学习挑战而非摊销问题,并指出可扩展的字典学习是叠加状态下稀疏推断的关键开放性问题。