Concept Component Analysis: A Principled Approach for Concept Extraction in LLMs

Developing human understandable interpretation of large language models (LLMs) becomes increasingly critical for their deployment in essential domains. Mechanistic interpretability seeks to mitigate the issues through extracts human-interpretable process and concepts from LLMs' activations. Sparse autoencoders (SAEs) have emerged as a popular approach for extracting interpretable and monosemantic concepts by decomposing the LLM internal representations into a dictionary. Despite their empirical progress, SAEs suffer from a fundamental theoretical ambiguity: the well-defined correspondence between LLM representations and human-interpretable concepts remains unclear. This lack of theoretical grounding gives rise to several methodological challenges, including difficulties in principled method design and evaluation criteria. In this work, we show that, under mild assumptions, LLM representations can be approximated as a {linear mixture} of the log-posteriors over concepts given the input context, through the lens of a latent variable model where concepts are treated as latent variables. This motivates a principled framework for concept extraction, namely Concept Component Analysis (ConCA), which aims to recover the log-posterior of each concept from LLM representations through a {unsupervised} linear unmixing process. We explore a specific variant, termed sparse ConCA, which leverages a sparsity prior to address the inherent ill-posedness of the unmixing problem. We implement 12 sparse ConCA variants and demonstrate their ability to extract meaningful concepts across multiple LLMs, offering theory-backed advantages over SAEs.

翻译：为大型语言模型（LLM）开发人类可理解的解释，对于其在关键领域的部署变得日益重要。机制可解释性旨在通过从LLM的激活中提取人类可解释的过程和概念来缓解相关问题。稀疏自编码器（SAE）已成为一种流行的方法，它通过将LLM的内部表示分解为字典来提取可解释且单义的概念。尽管取得了经验性进展，但SAE存在一个根本性的理论模糊性：LLM表示与人类可解释概念之间明确定义的对应关系仍不清楚。这种理论基础的缺乏引发了几种方法论上的挑战，包括原则性方法设计和评估标准的困难。在这项工作中，我们证明，在温和的假设下，通过将概念视为潜在变量的隐变量模型视角，LLM的表示可以近似为给定输入上下文条件下概念对数后验的{线性混合}。这激发了一种原则性的概念提取框架，即概念成分分析（ConCA），其目标是通过{无监督}的线性解混过程从LLM表示中恢复每个概念的对数后验。我们探索了一种称为稀疏ConCA的特定变体，它利用稀疏性先验来解决解混问题固有的不适定性。我们实现了12种稀疏ConCA变体，并展示了它们在多种LLM中提取有意义概念的能力，提供了相较于SAE的理论支持优势。