Interpreting language models remains challenging due to the existence of residual stream, which linearly mixes and duplicates features across adjacent layers, causing single-layer analyses to miss this cross-layer structure. Cross-layer sparse autoencoders (SAEs) address layer mixing but operate in continuous space, where concepts split across many neurons without clear boundaries. We introduce Cross-Layer Vector Quantized-Variational Autoencoder (CLVQ-VAE), a novel framework which maps representations from a lower layer to a higher layer through a discrete vector-quantization bottleneck, collapsing duplicated residual-stream features into compact, interpretable concept vectors. Our approach combines top-k temperature-based sampling with exponential moving average (EMA) codebook updates, providing controlled exploration of the discrete latent space while maintaining codebook diversity. Across both encoder- and decoder-based models on ERASER-Movie, Jigsaw, and AGNews, CLVQ-VAE outperforms clustering, single-layer vector quantized-variational autoencoder (VQ-VAE), and sparse autoencoder (SAE) baselines across three evaluation axes: removing identified concepts drops model accuracy by up to 93%, LLM judges rank our concepts first in 66.7% of comparisons, and human annotators recover model predictions from our visualizations with 78% accuracy versus 54% for clustering.
翻译:解释语言模型仍具挑战性,原因在于残差流的存在,它会在相邻层之间线性混合并复制特征,导致单层分析忽略这种跨层结构。跨层稀疏自编码器虽能处理层混合问题,但其在连续空间中运作,概念会分散在众多神经元中且缺乏清晰边界。我们提出跨层向量量化变分自编码器(CLVQ-VAE),这是一种新颖框架,通过离散向量量化瓶颈将低层表示映射至高层,将重复的残差流特征压缩为紧凑且可解释的概念向量。该方法结合基于top-k温度的采样与指数移动平均(EMA)码本更新,在保持码本多样性的同时,实现对离散潜空间的受控探索。在基于编码器和解码器的模型上,针对ERASER-Movie、Jigsaw和AGNews数据集,CLVQ-VAE在三个评估维度上均优于聚类、单层向量量化变分自编码器(VQ-VAE)和稀疏自编码器基线:移除识别出的概念可使模型准确率下降高达93%,大语言模型评判者将我们的概念排在第一的比例达66.7%,人类标注者基于可视化结果恢复模型预测的准确率为78%,而聚类方法仅为54%。