DB-KSVD: Scalable Alternating Optimization for Disentangling High-Dimensional Embedding Spaces

Dictionary learning has recently emerged as a promising approach for mechanistic interpretability of large transformer models. Disentangling high-dimensional transformer embeddings requires algorithms that scale to high-dimensional data with large sample sizes. Recent work has explored sparse autoencoders (SAEs) for this problem. However, SAEs use a simple linear encoder to solve the sparse encoding subproblem, which is known to be NP-hard. It is therefore interesting to understand whether this approach is sufficient to find good solutions to the dictionary learning problem or if a more sophisticated algorithm could find better solutions. In this work, we propose Double-Batch KSVD (DB-KSVD), a scalable dictionary learning algorithm that adapts the classic KSVD algorithm. DB-KSVD is informed by the rich theoretical foundations of KSVD but scales to datasets with millions of samples and thousands of dimensions. We demonstrate the efficacy of DB-KSVD by disentangling text embeddings of the Gemma-2-2B and Pythia-160M models and evaluating on six metrics from the SAEBench benchmark, where we achieve competitive results when compared to established approaches based on SAEs. We further show similar results when disentangling image embeddings obtained from the DINOv2-S and DINOv2-B models, solidifying our findings. By matching SAE performance with an entirely different optimization approach, our results suggest that (i) SAEs do find strong solutions to the dictionary learning problem and (ii) traditional optimization approaches can be scaled to the required problem sizes, offering a promising avenue for further research. We make an implementation of DB-KSVD available at https://github.com/romeov/ksvd.jl.

翻译：字典学习近期已成为大型Transformer模型机制可解释性领域的一种有前景方法。解耦高维Transformer嵌入需要能扩展至大样本量高维数据的算法。已有研究探索了使用稀疏自编码器（SAE）解决该问题，但SAE采用简单的线性编码器求解稀疏编码子问题——该问题已知为NP难问题。因此，理解这种简单方法是否足以找到字典学习问题的良好解，抑或更复杂的算法能获得更优解，具有重要研究意义。本文提出双批次KSVD算法（DB-KSVD），这是一种可扩展的字典学习算法，通过对经典KSVD算法进行适应性改进实现。DB-KSVD基于KSVD的丰富理论基础，但可扩展至百万级样本量和千维数据集。我们通过解耦Gemma-2-2B和Pythia-160M模型的文本嵌入，并在SAEBench基准的六项指标上进行评估，验证了DB-KSVD的有效性——与基于SAE的成熟方法相比取得了具有竞争力的结果。我们进一步在DINOv2-S和DINOv2-B模型的图像嵌入解耦实验中展示了相似结果，巩固了研究发现。通过采用完全不同的优化方法达到与SAE相当的性能，我们的结果表明：(i) SAE确实找到了字典学习问题的强解；(ii) 传统优化方法可扩展至所需问题规模，为后续研究提供了有前景的方向。我们在 https://github.com/romeov/ksvd.jl 上提供了DB-KSVD的实现代码。