Language models demonstrate remarkable capacity to generalize representations learned in one modality to downstream tasks in other modalities. Can we trace this ability to individual neurons? We study the case where a frozen text transformer is augmented with vision using a self-supervised visual encoder and a single linear projection learned on an image-to-text task. Outputs of the projection layer are not immediately decodable into language describing image content; instead, we find that translation between modalities occurs deeper within the transformer. We introduce a procedure for identifying "multimodal neurons" that convert visual representations into corresponding text, and decoding the concepts they inject into the model's residual stream. In a series of experiments, we show that multimodal neurons operate on specific visual concepts across inputs, and have a systematic causal effect on image captioning.
翻译:语言模型展现了将一种模态学到的表示泛化到其他模态下游任务的惊人能力。我们能否将这种能力追溯到单个神经元?我们研究了一种场景:通过一个自监督视觉编码器和在图像到文本任务上学得的单一线性投影,为冻结的文本变换器增加视觉能力。投影层的输出并不能立即解码为描述图像内容的语言;相反,我们发现模态间的转换发生在变换器更深层。我们引入了一种程序,用于识别将视觉表示转换为对应文本的"多模态神经元",并解码它们注入模型残差流中的概念。在一系列实验中,我们展示了多模态神经元跨输入处理特定视觉概念,并对图像描述产生系统性的因果影响。