Language and vision-language models have shown impressive performance across a wide range of tasks, but their internal mechanisms remain only partly understood. In this work, we study how individual attention heads in text-generative models specialize in specific semantic or visual attributes. Building on an established interpretability method, we reinterpret the practice of probing intermediate activations with the final decoding layer through the lens of signal processing. This lets us analyze multiple samples in a principled way and rank attention heads based on their relevance to target concepts. Our results show consistent patterns of specialization at the head level across both unimodal and multimodal transformers. Remarkably, we find that editing as few as 1% of the heads, selected using our method, can reliably suppress or enhance targeted concepts in the model output. We validate our approach on language tasks such as question answering and toxicity mitigation, as well as vision-language tasks including image classification and captioning. Our findings highlight an interpretable and controllable structure within attention layers, offering simple tools for understanding and editing large-scale generative models.
翻译:语言模型与视觉-语言模型已在广泛任务中展现出卓越性能,但其内部工作机制仍未被完全理解。本研究聚焦于文本生成模型中单个注意力头对特定语义或视觉属性的专业化机制。基于既有可解释性方法,我们从信号处理视角重新阐释了通过最终解码层探测中间激活值的实践。该方法使我们能够以系统化方式分析多样本,并根据注意力头与目标概念的相关性进行排序。实验结果表明,在单模态与多模态Transformer中均存在头部层面的稳定专业化模式。值得注意的是,我们发现仅需编辑通过本方法筛选的1%注意力头,即可可靠地抑制或增强模型输出中的目标概念。我们在问答与毒性缓解等语言任务,以及图像分类与描述生成等视觉-语言任务上验证了该方法的有效性。本研究揭示了注意力层内可解释且可控的结构特征,为理解与编辑大规模生成模型提供了简洁有效的工具。