Despite their capabilities, Large Language Models (LLMs) remain opaque with limited understanding of their internal representations. Current interpretability methods either focus on input-oriented feature extraction, such as supervised probes and Sparse Autoencoders (SAEs), or on output distribution inspection, such as logit-oriented approaches. A full understanding of LLM vector spaces, however, requires integrating both perspectives, something existing approaches struggle with due to constraints on latent feature definitions. We introduce the Hyperdimensional Probe, a hybrid supervised probe that combines symbolic representations with neural probing. Leveraging Vector Symbolic Architectures (VSAs) and hypervector algebra, it unifies prior methods: the top-down interpretability of supervised probes, SAE's sparsity-driven proxy space, and output-oriented logit investigation. This allows deeper input-focused feature extraction while supporting output-oriented investigation. Our experiments show that our method consistently extracts meaningful concepts across LLMs, embedding sizes, and setups, uncovering concept-driven patterns in analogy-oriented inference and QA-focused text generation. By supporting joint input-output analysis, this work advances semantic understanding of neural representations while unifying the complementary perspectives of prior methods.
翻译:尽管大语言模型(LLMs)展现出强大能力,其内部表征仍不透明,理解有限。现有可解释性方法或侧重于输入导向的特征提取(如监督探针与稀疏自编码器),或聚焦于输出分布的检查(如基于对数几率的分析方法)。然而,全面理解LLM的向量空间需要整合这两种视角,而现有方法因潜在特征定义的约束难以实现。我们提出高维探针,一种结合符号表征与神经探测的混合监督探针。该方法利用向量符号架构与高维向量代数,统一了先前方法:监督探针的自顶向下可解释性、稀疏自编码器的稀疏驱动代理空间,以及输出导向的对数几率分析。这实现了更深层次的输入聚焦特征提取,同时支持输出导向的探究。实验表明,我们的方法在不同LLM、嵌入维度和设置下均能稳定提取有意义的概念,揭示了类比推理与问答导向文本生成中的概念驱动模式。通过支持输入-输出联合分析,本研究推进了对神经表征的语义理解,并统一了现有方法的互补视角。