Large Language Models (LLMs) encode factual knowledge within hidden parametric spaces that are difficult to inspect or control. While Sparse Autoencoders (SAEs) can decompose hidden activations into more fine-grained, interpretable features, they often struggle to reliably align these features with human-defined concepts, resulting in entangled and distributed feature representations. To address this, we introduce AlignSAE, a method that aligns SAE features with a predefined ontology through a "pre-train, then post-train" curriculum. After an initial unsupervised training phase, we apply supervised post-training to bind specific concepts to dedicated latent slots while preserving the remaining capacity for general reconstruction. This separation creates an interpretable interface where specific concepts can be inspected and controlled without interference from unrelated features. Empirical results demonstrate that AlignSAE enables precise causal interventions, such as reliable "concept swaps", by targeting single, semantically aligned slots, and further supports multi-hop reasoning and a mechanistic probe of grokking-like generalization dynamics.
翻译:大型语言模型(LLM)将事实性知识编码在隐藏的参数空间中,这些空间难以检查或控制。虽然稀疏自编码器(SAE)能够将隐藏激活分解为更细粒度、可解释的特征,但它们往往难以可靠地将这些特征与人类定义的概念对齐,导致特征表示存在纠缠和分布分散的问题。为解决这一问题,我们提出了AlignSAE,该方法通过“预训练,后训练”的课程学习,将SAE特征与预定义的本体进行对齐。在初始的无监督训练阶段之后,我们应用有监督的后训练,将特定概念绑定到专用的潜在槽位,同时保留其余容量用于一般重构。这种分离创建了一个可解释的接口,其中特定概念可以在不受无关特征干扰的情况下进行检查和控制。实证结果表明,AlignSAE通过针对单个语义对齐的槽位,能够实现精确的因果干预,例如可靠的“概念交换”,并进一步支持多跳推理以及对类顿悟泛化动态的机制性探测。