Multimodal Large Language Models (MLLMs) have demonstrated strong performance on vision-language tasks, yet their internal visual representations remain difficult to interpret. Sparse Autoencoders (SAEs) provide a scalable way to decompose dense model activations into sparse, interpretable features. However, existing SAE architectures primarily recover flat feature dictionaries and are less suited for explicit multi-level concept organization. In this paper, we introduce cascaded sparse autoencoders (CSAEs) for learning hierarchical visual concepts in MLLMs. Rather than nesting or stacking SAE sparse activation codes, CSAEs train a second-level SAE directly on the decoder weights of the first-level SAE, treating learned low-level feature directions as inputs for higher-level abstraction. This design enables CSAEs to learn "concepts of concepts" while avoiding drawbacks from the shared-prefix coupling of nesting, Matryoshka-style hierarchies and the bottlenecks of naively stacked SAEs. Experiments across Qwen3-VL, Gemma-3, and LLaVA on multiple visual datasets show that CSAEs improve interpretability in terms of hierarchical concept coherence over state-of-the-art SAE baselines. Results on concept steering further demonstrate that the learned concept groups support effective group-level interventions in MLLM outputs.
翻译:多模态大语言模型在视觉-语言任务中展现出强大性能,但其内部视觉表征仍难以解释。稀疏自编码器提供了一种可扩展的方法,可将稠密的模型激活分解为稀疏且可解释的特征。然而,现有SAE架构主要恢复扁平的特征字典,难以实现显式的多层级概念组织。本文提出级联稀疏自编码器用于学习MLLM中的层级视觉概念。CSAEs并非嵌套或堆叠SAE稀疏激活码,而是直接在首层SAE的解码器权重上训练第二层SAE,将已学习的低级特征方向作为高层抽象的输入。该设计使CSAEs能够学习"概念的概念",同时避免嵌套式Matryoshka层级结构中共享前缀耦合的缺陷以及朴素堆叠SAE的瓶颈问题。在Qwen3-VL、Gemma-3和LLaVA上针对多个视觉数据集的实验表明,相较于当前最优的SAE基线,CSAEs在层级概念连贯性方面提升了可解释性。概念引导实验结果进一步证明,学习到的概念群组可支持对MLLM输出进行有效的层级干预。