This work examines whether decoder-only Transformers such as LLaMA, which were originally designed for large language models (LLMs), can be adapted to the computer vision field. We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a casual mask to the self-attention brings an attention collapse issue, resulting in the failure to the network training. We suggest to reposition the class token behind the image tokens with a post-sequence class token technique to overcome this challenge, enabling causal self-attention to efficiently capture the entire image's information. Additionally, we develop a soft mask strategy that gradually introduces a casual mask to the self-attention at the onset of training to facilitate the optimization behavior. The tailored model, dubbed as image LLaMA (iLLaMA), is akin to LLaMA in architecture and enables direct supervised learning. Its causal self-attention boosts computational efficiency and learns complex representation by elevating attention map ranks. iLLaMA rivals the performance with its encoder-only counterparts, achieving 75.1% ImageNet top-1 accuracy with only 5.7M parameters. Scaling the model to ~310M and pre-training on ImageNet-21K further enhances the accuracy to 86.0%. Extensive experiments demonstrate iLLaMA's reliable properties: calibration, shape-texture bias, quantization compatibility, ADE20K segmentation and CIFAR transfer learning. We hope our study can kindle fresh views to visual model design in the wave of LLMs. Pre-trained models and codes are available here.
翻译:本研究探讨了最初为大型语言模型(LLM)设计的仅解码器Transformer(如LLaMA)能否适配到计算机视觉领域。我们首先逐步将标准ViT“LLaMA化”以对齐LLaMA架构,发现直接对自注意力应用因果掩码会导致注意力崩溃问题,造成网络训练失败。我们提出通过后序列类别标记技术将类别标记重新定位至图像标记之后,以克服这一挑战,使因果自注意力能够有效捕捉完整图像信息。此外,我们开发了一种软掩码策略,在训练初期逐步向自注意力引入因果掩码以优化训练行为。经定制的模型称为图像LLaMA(iLLaMA),其架构与LLaMA相似,并支持直接监督学习。其因果自注意力机制通过提升注意力图秩次,既提高了计算效率又能学习复杂表征。iLLaMA在性能上可与仅编码器模型相媲美,仅用570万参数即在ImageNet上达到75.1%的top-1准确率。将模型规模扩展至约3.1亿参数并在ImageNet-21K上进行预训练后,准确率进一步提升至86.0%。大量实验证明了iLLaMA的可靠特性:校准能力、形状-纹理偏置、量化兼容性、ADE20K分割任务及CIFAR迁移学习表现。我们希望这项研究能为LLM浪潮下的视觉模型设计带来新视角。预训练模型与代码已公开。