Sparse Autoencoders (SAEs) have emerged as a useful tool for interpreting the internal representations of neural networks. However, naively optimising SAEs for reconstruction loss and sparsity results in a preference for SAEs that are extremely wide and sparse. We present an information-theoretic framework for interpreting SAEs as lossy compression algorithms for communicating explanations of neural activations. We appeal to the Minimal Description Length (MDL) principle to motivate explanations of activations which are both accurate and concise. We further argue that interpretable SAEs require an additional property, "independent additivity": features should be able to be understood separately. We demonstrate an example of applying our MDL-inspired framework by training SAEs on MNIST handwritten digits and find that SAE features representing significant line segments are optimal, as opposed to SAEs with features for memorised digits from the dataset or small digit fragments. We argue that using MDL rather than sparsity may avoid potential pitfalls with naively maximising sparsity such as undesirable feature splitting and that this framework naturally suggests new hierarchical SAE architectures which provide more concise explanations.
翻译:稀疏自编码器已成为解释神经网络内部表征的有效工具。然而,单纯针对重构损失和稀疏性优化的SAE往往倾向于生成极宽且稀疏的结构。本文提出一个信息论框架,将SAE解释为传递神经激活解释的有损压缩算法。我们依据最小描述长度原则,构建兼具准确性与简洁性的激活解释模型。进一步提出可解释SAE需具备"独立可加性"特性:特征应能独立理解。通过在MNIST手写数字数据集上训练SAE的实例验证,发现表征显著线段特征的SAE具有最优性,而非记忆数据集数字或微小数字碎片的特征。研究表明,采用MDL准则可避免单纯追求稀疏性导致的特征分裂等问题,该框架自然衍生出能提供更简洁解释的新型分层SAE架构。