The recently proposed SparseFormer architecture provides an alternative approach to visual understanding by utilizing a significantly lower number of visual tokens via adjusting RoIs, greatly reducing computational costs while still achieving promising performance. However, training SparseFormers from scratch is still expensive, and scaling up the number of parameters can be challenging. In this paper, we propose to bootstrap SparseFormers from ViT-based vision foundation models in a simple and efficient way. Since the majority of SparseFormer blocks are the standard transformer ones, we can inherit weights from large-scale pre-trained vision transformers and freeze them as much as possible. Therefore, we only need to train the SparseFormer-specific lightweight focusing transformer to adjust token RoIs and fine-tune a few early pre-trained blocks to align the final token representation. In such a way, we can bootstrap SparseFormer architectures from various large-scale pre-trained models (e.g., IN-21K pre-trained AugRegs or CLIPs) using a rather smaller amount of training samples (e.g., IN-1K) and without labels or captions within just a few hours. As a result, the bootstrapped unimodal SparseFormer (from AugReg-ViT-L/16-384) can reach 84.9% accuracy on IN-1K with only 49 tokens, and the multimodal SparseFormer from CLIPs also demonstrates notable zero-shot performance with highly reduced computational cost without seeing any caption during the bootstrapping procedure. In addition, CLIP-bootstrapped SparseFormers, which align the output space with language without seeing a word, can serve as efficient vision encoders in multimodal large language models. Code will be publicly available at https://github.com/showlab/sparseformer
翻译:近期提出的稀疏变换器架构通过调整感兴趣区域显著减少视觉标记数量,在降低计算成本的同时仍能取得良好性能,为视觉理解提供了新途径。然而从头训练稀疏变换器仍然代价高昂,且扩展参数量面临挑战。本文提出一种简单高效的方法,基于ViT视觉基础模型引导稀疏变换器。由于稀疏变换器块多为标准变换器结构,可直接继承大规模预训练视觉变换器的权重并尽可能冻结。因此仅需训练稀疏变换器特有的轻量聚焦变换器调整标记感兴趣区域,并微调少量早期预训练块以对齐最终标记表示。通过这种方式,我们可利用较少训练样本(如ImageNet-1K)、无需标签或标注、在数小时内从多种大规模预训练模型(如ImageNet-21K预训练的AugRegs或CLIPs)中引导出稀疏变换器架构。实验表明,引导得到的单模态稀疏变换器(基于AugReg-ViT-L/16-384)在仅使用49个标记时,在ImageNet-1K上达到84.9%准确率;多模态稀疏变换器(基于CLIPs)在引导过程中未见任何文本描述,仍以极低计算成本展现出显著零样本性能。此外,基于CLIP引导的稀疏变换器在未见任何词语的情况下对齐输出空间与语言,可作为多模态大语言模型的高效视觉编码器。代码将开源至https://github.com/showlab/sparseformer