Large language models (LLMs) encode rich semantic information in their hidden states, yet it remains difficult to understand what information these internal representations capture. Latent concepts extracted from hidden states offer a promising direction for interpreting LLMs, but existing clustering-based methods face a trade-off: hierarchical clustering produces coherent concepts but is limited to small datasets due to its quadratic memory cost, while K-Means scales efficiently but may yield less semantically coherent concepts. We propose Vector Quantized Latent Concept (VQLC), a discrete concept learning framework that learns a codebook of latent concepts on frozen hidden states. Across 12 dataset-model settings, VQLC stays close to K-Means in computational cost, scales better than hierarchical clustering, and remains competitive in faithfulness, with the clearest gains on decoder-only models. LLMs-based evaluation, qualitative analysis, and a Sparse Autoencoder (SAE) comparison demonstrate that the learned concepts are interpretable and task-relevant.
翻译:大型语言模型(LLM)在隐层状态中编码了丰富的语义信息,但理解这些内部表征捕获了何种信息仍具挑战。从隐层状态中提取的隐式概念为解释LLM提供了有前景的方向,但现有基于聚类的方法面临权衡:层次聚类能生成连贯的概念,但因二次方内存成本仅适用于小规模数据集;而K-Means虽然可高效扩展,却可能产生语义连贯性较弱的概念。我们提出向量量化隐式概念(VQLC),一种在冻结隐层状态上学习隐式概念码本的离散概念学习框架。在12组数据集-模型配置下,VQLC在计算成本上接近K-Means,扩展性优于层次聚类,并在忠实性方面保持竞争力,在仅解码器模型上表现最为显著。基于LLM的评估、定性分析以及与稀疏自编码器(SAE)的对比表明,学得的概念具有可解释性且与任务相关。