Sparse autoencoders (SAEs) have proven effective for extracting monosemantic features from large language models (LLMs), yet these features are typically identified in isolation. However, broad evidence suggests that LLMs capture the intrinsic structure of natural language, where the phenomenon of "feature splitting" in particular indicates that such structure is hierarchical. To capture this, we propose the Hierarchical Sparse Autoencoder (HSAE), which jointly learns a series of SAEs and the parent-child relationships between their features. HSAE strengthens the alignment between parent and child features through two novel mechanisms: a structural constraint loss and a random feature perturbation mechanism. Extensive experiments across various LLMs and layers demonstrate that HSAE consistently recovers semantically meaningful hierarchies, supported by both qualitative case studies and rigorous quantitative metrics. At the same time, HSAE preserves the reconstruction fidelity and interpretability of standard SAEs across different dictionary sizes. Our work provides a powerful, scalable tool for discovering and analyzing the multi-scale conceptual structures embedded in LLM representations.
翻译:稀疏自编码器(SAEs)已被证明能有效从大语言模型(LLMs)中提取单语义特征,但这些特征通常被孤立地识别。然而,广泛证据表明LLMs捕捉了自然语言的内在结构,其中“特征分裂”现象尤其表明这种结构是层次化的。为捕捉这一特性,我们提出了分层稀疏自编码器(HSAE),它能联合学习一系列SAE及其特征间的父子关系。HSAE通过两种新颖机制——结构约束损失和随机特征扰动机制——加强了父子特征间的对齐。在不同LLM和网络层上的大量实验表明,HSAE能持续恢复具有语义意义的层次结构,这得到了定性案例研究和严格定量指标的双重支持。同时,HSAE在不同字典大小下均保持了标准SAE的重建保真度和可解释性。我们的工作为发现和分析嵌入在LLM表征中的多尺度概念结构提供了一个强大且可扩展的工具。