Vision-language models learn powerful multimodal embeddings, yet their internal semantics remain opaque. While sparse autoencoders (SAEs) can extract interpretable features, they rely on expanding the representation dimension, which compromises the original geometry and introduces redundancy. We introduce CEDAR (Conceptual Embedding Disentanglement via Adaptive Rotation), a post-hoc method that reveals the compositional structure of pretrained embeddings without increasing dimensionality. By learning an invertible transformation with a top-$k$ sparsity bottleneck, CEDAR concentrates semantic information into axis-aligned disentangled coordinates. In CLIP-like architecture, individual coordinates can be interpreted with textual concepts, while for generative models such as BLIP, they can be decoded into natural language descriptions. Experiments demonstrate that CEDAR achieves a competitive reconstruction-sparsity trade-off while producing explanations that are more interpretable and better aligned with human perception. Our results suggest that the apparent entanglement in vision-language representations can be resolved through a suitable change of basis, eliminating the need for overcomplete expansions.
翻译:视觉语言模型能够学习强大的多模态嵌入,但其内部语义仍不透明。尽管稀疏自编码器(SAE)可以提取可解释特征,但它们依赖于扩展表示维度,这损害了原始几何结构并引入冗余。我们提出CEDAR(通过自适应旋转实现概念化嵌入解耦),这是一种事后方法,能在不增加维度的前提下揭示预训练嵌入的组成结构。通过使用top-$k$稀疏瓶颈学习可逆变换,CEDAR将语义信息集中于对齐坐标轴的解耦坐标中。在类CLIP架构中,单个坐标可通过文本概念进行解释;而对于BLIP等生成模型,这些坐标可解码为自然语言描述。实验表明,CEDAR在重建与稀疏性之间实现了竞争性的平衡,同时生成更具可解释性且更符合人类感知的说明。我们的结果表明,视觉语言表征中看似纠缠的语义可通过适当的基变换得到解决,从而无需使用超完备扩展。