Language models demonstrate remarkable capacity to generalize representations learned in one modality to downstream tasks in other modalities. Can we trace this ability to individual neurons? We study the case where a frozen text transformer is augmented with vision using a self-supervised visual encoder and a single linear projection learned on an image-to-text task. Outputs of the projection layer are not immediately decodable into language describing image content; instead, we find that translation between modalities occurs deeper within the transformer. We introduce a procedure for identifying "multimodal neurons" that convert visual representations into corresponding text, and decoding the concepts they inject into the model's residual stream. In a series of experiments, we show that multimodal neurons operate on specific visual concepts across inputs, and have a systematic causal effect on image captioning.
翻译:语言模型展现出将从单一模态学到的表示泛化到其他模态下游任务的非凡能力。我们能否将这种能力追溯到单个神经元?我们研究了这样一种情况:将冻结的文本Transformer与自监督视觉编码器及一个在图像到文本任务上学到的线性投影层相结合来增强视觉能力。投影层的输出并非能立即解码成描述图像内容的语言;相反,我们发现模态间的转换发生在Transformer更深的层中。我们提出了一种识别“多模态神经元”的程序,这些神经元能将视觉表示转化为相应的文本,并解码它们注入模型残差流中的概念。在一系列实验中,我们表明多模态神经元在多种输入上对特定视觉概念进行操作,并对图像描述产生系统性的因果影响。