Activation decomposition methods in language models are tightly coupled to geometric assumptions on how concepts are realized in activation space. Existing approaches search for individual global directions, implicitly assuming linear separability, which overlooks concepts with nonlinear or multi-dimensional structure. In this work, we leverage Mixture of Factor Analyzers (MFA) as a scalable, unsupervised alternative that models the activation space as a collection of Gaussian regions with their local covariance structure. MFA decomposes activations into two compositional geometric objects: the region's centroid in activation space, and the local variation from the centroid. We train large-scale MFAs for Llama-3.1-8B and Gemma-2-2B, and show they capture complex, nonlinear structures in activation space. Moreover, evaluations on localization and steering benchmarks show that MFA outperforms unsupervised baselines, is competitive with supervised localization methods, and often achieves stronger steering performance than sparse autoencoders. Together, our findings position local geometry, expressed through subspaces, as a promising unit of analysis for scalable concept discovery and model control, accounting for complex structures that isolated directions fail to capture.
翻译:语言模型中的激活分解方法与激活空间中概念实现的几何假设紧密相关。现有方法通过寻找单个全局方向进行分解,其隐含的线性可分性假设忽略了具有非线性或多维结构的概念。本研究采用因子分析器混合模型(MFA)作为一种可扩展的无监督替代方案,将激活空间建模为具有局部协方差结构的高斯区域集合。MFA将激活分解为两个组合几何对象:激活空间中区域的质心,以及质心的局部变异。我们为Llama-3.1-8B和Gemma-2-2B训练了大规模MFA模型,结果表明其能够捕捉激活空间中复杂的非线性结构。此外,在定位和导向基准测试中的评估显示,MFA在无监督基线方法中表现优异,与有监督定位方法竞争力相当,且其导向性能通常优于稀疏自编码器。综合而言,我们的研究结果表明,通过子空间表达的局部几何结构可作为可扩展概念发现和模型控制的有效分析单元,能够捕捉孤立方向所无法表征的复杂结构。