ICA Lens: Interpreting Language Models Without Training Another Dictionary

Finding interpretable directions in language-model representations is critical for understanding and controlling model behavior. Sparse autoencoders (SAEs) have become the standard tool for this purpose, but using them as the default first lens often requires training, storing, and evaluating large overcomplete dictionaries. This bottleneck limits rapid exploration and raises a fundamental question: how much interpretable structure is already visible from activation geometry before training another neural dictionary? Our intuition is simple: many interpretable directions are selective on tokens, and these directions should look less Gaussian than random directions. We therefore revisit independent component analysis (ICA), a classical method for finding non-Gaussian directions, as a compact lens for language-model interpretability. We find that ICA has been underestimated for LLM interpretability, because prior uses often relied on off-the-shelf ICA implementations that are brittle on LLM activations and lacked systematic tools for inspecting and evaluating the recovered directions. To bridge these gaps, we introduce ICALens, the first practical workflow for stable, efficient, and auditable ICA analysis of LLM representations. It combines an optimized GPU-parallel FastICA pipeline with LLM-specific stability recipes and better fitting diagnostics, enabling efficient and reliable layer-wise analysis. Across GPT-2 Small, Gemma 2 2B, and Qwen 3.5 2B Base, ICALens efficiently recovers compact, human-interpretable directions without per-layer gradient-based dictionary training. On SAEBench, ICA is competitive with public SAEs in sparse probing and outperforms them in targeted probe perturbation under small-to-medium budgets. These results suggest that ICA should not be viewed as a weak baseline, but as an efficient and complementary first lens for exploring language-model representations.

翻译：在语言模型表征中寻找可解释方向，对于理解与控制模型行为至关重要。稀疏自编码器已成为这一任务的标准工具，但将其作为默认的第一视角透镜使用时，通常需要训练、存储和评估大规模过完备字典。这一瓶颈限制了快速探索的可能性，并引出一个根本性问题：在无需训练新神经网络字典的前提下，我们从激活几何结构中能直接观察到多少可解释结构？我们的直觉很简单：许多可解释方向对词元具有选择性，这些方向应比随机方向更不服从高斯分布。因此，我们重新审视独立成分分析——一种用于寻找非高斯方向的经典方法，将其作为语言模型可解释性的紧凑透镜。我们发现独立成分分析在大型语言模型可解释性方面的潜力被低估了，原因在于先前应用常依赖现成的独立成分分析实现，这些实现在处理大语言模型激活时具有脆弱性，且缺乏系统工具来检查和评估恢复的方向。为弥补这些不足，我们提出ICALens——首个面向大语言模型表征稳定、高效且可审计的独立成分分析工作流。它结合了经GPU优化的并行快速独立成分分析管道、专为大语言模型设计的稳定性配方及更优的拟合诊断技术，实现了高效可靠的逐层分析。在GPT-2 Small、Gemma 2 2B和Qwen 3.5 2B Base模型上，ICALens无需逐层基于梯度的字典训练即可高效恢复紧凑、人工可解释的方向。在SAEBench基准测试中，独立成分分析在稀疏探测任务上与公开稀疏自编码器性能相当，并在中小规模预算下的目标探测扰动任务中表现更优。这些结果表明，独立成分分析不应仅被视为弱基线方法，而应作为探索语言模型表征的高效互补第一透镜。