Polysemanticity is pervasive in language models and remains a major challenge for interpretation and model behavioral control. Leveraging sparse autoencoders (SAEs), we map the polysemantic topology of two small models (Pythia-70M and GPT-2-Small) to identify SAE feature pairs that are semantically unrelated yet exhibit interference within models. We intervene at four foci (prompt, token, feature, neuron) and measure induced shifts in the next-token prediction distribution, uncovering polysemantic structures that expose a systematic vulnerability in these models. Critically, interventions distilled from counterintuitive interference patterns shared by two small models transfer reliably to larger instruction-tuned models (Llama-3.1-8B/70B-Instruct and Gemma-2-9B-Instruct), yielding predictable behavioral shifts without access to model internals. These findings challenge the view that polysemanticity is purely stochastic, demonstrating instead that interference structures generalize across scale and family. Such generalization suggests a convergent, higher-order organization of internal representations, which is only weakly aligned with intuition and structured by latent regularities, offering new possibilities for both black-box control and theoretical insight into human and artificial cognition.
翻译:多义性是语言模型的普遍现象,并持续构成解释与行为控制的主要挑战。通过利用稀疏自编码器(SAE),我们绘制了两个小型模型(Pythia-70M和GPT-2-Small)的多义拓扑结构,识别出语义无关却存在模型内干扰的SAE特征对。我们针对四个干预焦点(提示、令牌、特征、神经元)实施干预,并测量由此引发的下一令牌预测分布偏移,揭示了暴露这些模型系统性脆弱性的多义结构。关键在于,从两个小型模型共有的反直觉干扰模式中提取的干预策略,能稳定迁移至更大规模的指令微调模型(Llama-3.1-8B/70B-Instruct和Gemma-2-9B-Instruct),在无需访问模型内部结构的情况下产生可预测的行为偏移。这些发现挑战了多义性纯属随机噪声的观点,证明干扰结构可跨模型规模与架构族泛化。此类泛化暗示存在收敛的高阶内部表征组织——该组织与直觉仅有弱对齐性,并由潜在规律性所结构化——为黑箱控制及人类与人工智能认知的理论洞见提供了新可能。