Adapting LLaMA Decoder to Vision Transformer

This work examines whether decoder-only Transformers such as LLaMA, which were originally designed for large language models (LLMs), can be adapted to the computer vision field. We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a causal mask to the self-attention brings an attention collapse issue, resulting in the failure to the network training. We suggest to reposition the class token behind the image tokens with a post-sequence class token technique to overcome this challenge, enabling causal self-attention to efficiently capture the entire image's information. Additionally, we develop a soft mask strategy that gradually introduces a causal mask to the self-attention at the onset of training to facilitate the optimization behavior. The tailored model, dubbed as image LLaMA (iLLaMA), is akin to LLaMA in architecture and enables direct supervised learning. Its causal self-attention boosts computational efficiency and learns complex representation by elevating attention map ranks. iLLaMA rivals the performance with its encoder-only counterparts, achieving 75.1% ImageNet top-1 accuracy with only 5.7M parameters. Scaling the model to $\sim$310M and pre-training on ImageNet-21K further enhances the accuracy to 86.0%. Extensive experiments demonstrate iLLaMA's reliable properties: shape-texture bias, calibration, quantization compatibility, ADE20K segmentation and CIFAR transfer learning. We hope our study can kindle fresh views to visual architectures in the wave of LLMs and inspire the development of unified multimodal models. Pre-trained models and codes are available https://github.com/techmonsterwang/iLLaMA.

翻译：本研究探讨了诸如LLaMA等最初为大型语言模型设计的仅解码器Transformer能否适配计算机视觉领域。我们首先逐步将标准ViT“LLaMA化”以对齐LLaMA架构，发现直接将因果掩码应用于自注意力会引发注意力崩溃问题，导致网络训练失败。我们提出采用后序列类别标记技术将类别标记重新定位至图像标记之后，以克服这一挑战，使因果自注意力能够有效捕获完整图像信息。此外，我们开发了一种软掩码策略，在训练初始阶段逐步将因果掩码引入自注意力机制，以优化训练行为。经定制的模型称为图像LLaMA，其在架构上类似于LLaMA，并支持直接监督学习。其因果自注意力机制通过提升注意力图秩次，既提高了计算效率，又学习了复杂表征。iLLaMA的性能可与仅编码器模型相媲美，仅用570万参数即在ImageNet上达到75.1%的top-1准确率。将模型规模扩展至约3.1亿参数并在ImageNet-21K上进行预训练后，准确率进一步提升至86.0%。大量实验证明了iLLaMA的可靠特性：形状-纹理偏置、校准性、量化兼容性、ADE20K分割及CIFAR迁移学习。我们希望本研究能在LLM浪潮中为视觉架构带来新视角，并推动统一多模态模型的发展。预训练模型与代码已公开：https://github.com/techmonsterwang/iLLaMA。