Frozen Transformers in Language Models Are Effective Visual Encoder Layers

This paper reveals that large language models (LLMs), despite being trained solely on textual data, are surprisingly strong encoders for purely visual tasks in the absence of language. Even more intriguingly, this can be achieved by a simple yet previously overlooked strategy -- employing a frozen transformer block from pre-trained LLMs as a constituent encoder layer to directly process visual tokens. Our work pushes the boundaries of leveraging LLMs for computer vision tasks, significantly departing from conventional practices that typically necessitate a multi-modal vision-language setup with associated language prompts, inputs, or outputs. We demonstrate that our approach consistently enhances performance across a diverse range of tasks, encompassing pure 2D and 3D visual recognition tasks (e.g., image and point cloud classification), temporal modeling tasks (e.g., action recognition), non-semantic tasks (e.g., motion forecasting), and multi-modal tasks (e.g., 2D/3D visual question answering and image-text retrieval). Such improvements are a general phenomenon, applicable to various types of LLMs (e.g., LLaMA and OPT) and different LLM transformer blocks. We additionally propose the information filtering hypothesis to explain the effectiveness of pre-trained LLMs in visual encoding -- the pre-trained LLM transformer blocks discern informative visual tokens and further amplify their effect. This hypothesis is empirically supported by the observation that the feature activation, after training with LLM transformer blocks, exhibits a stronger focus on relevant regions. We hope that our work inspires new perspectives on utilizing LLMs and deepening our understanding of their underlying mechanisms. Code is available at https://github.com/ziqipang/LM4VisualEncoding.

翻译：本文揭示，尽管大型语言模型仅基于文本数据训练，但在纯视觉任务中（无语言参与）却表现出令人意外的强大编码能力。更引人注目的是，这一效果可通过一种简单但此前被忽视的策略实现——直接采用预训练大语言模型中的冻结Transformer块作为构成性编码层，用于处理视觉标记。我们的工作拓展了利用大语言模型解决计算机视觉任务的边界，显著有别于传统实践中通常需要多模态视觉-语言框架（伴随相关语言提示、输入或输出）的范式。我们证明，该方法在多种任务中均能持续提升性能，涵盖纯2D和3D视觉识别任务（如图像与点云分类）、时序建模任务（如动作识别）、非语义任务（如运动预测）以及多模态任务（如2D/3D视觉问答与图文检索）。此类改进具有普适性，适用于各类大语言模型（如LLaMA和OPT）及其不同Transformer块。此外，我们提出信息过滤假说以解释预训练大语言模型在视觉编码中的有效性——预训练大语言模型的Transformer块能够识别包含信息的视觉标记并进一步增强其效应。该假说获得实证支持：经大语言模型Transformer块训练后，特征激活对相关区域的关注度显著提升。我们期望本工作能为利用大语言模型及深入理解其内在机制提供新视角。代码已开源：https://github.com/ziqipang/LM4VisualEncoding。