ViTs are general and accurate, and address many tasks, but ViTs are slow, and are not always practical when efficiency is key. Existing methods for faster ViTs design hybrid non-ViT architectures, losing generality, or shrink their tokens, sacrificing accuracy. Many non-ViT architectures are both fast and accurate. Yet, without significant modifications, they cannot do what ViTs can: process other input shapes, pre-train by SOTA self-supervised learning, reduce computation by dropping tokens, and more. We make ViTs faster by reducing patch token width while increasing global token width by adding a new Jumbo token. Our wider Jumbo token is processed by its own wider FFN to increase model capacity. Yet our Jumbo FFN is efficient: it processes a single token, for speed, and its parameters are shared across all layers, for memory. Crucially, our Jumbo is attention-only and non-hierarchical, like a plain ViT, so it is simple, scalable, flexible, and compatible with ViT methods new and old. Jumbo improves over ViT baselines with Registers from Nano to Large scales while maintaining speed/throughput on ImageNet-1K (0.1-13%). Jumbo also improves segmentation (1.9-3.1% on ADE20K), MAE pre-training (4.9% linear probing on ImageNet-1K), test-time adaptation (5.2% on ImageNet-C), and time series modeling. Our Jumbo models even achieve better speed-accuracy trade-offs than specialized non-ViT compute-efficient models, while maintaining plain-ViT compatibility for practicality. Code and weights are available: https://github.com/antofuller/jumbo
翻译:视觉Transformer(ViT)通用且准确,能够处理多种任务,但其速度较慢,在效率至关重要时并不总是实用。现有的加速ViT方法设计混合的非ViT架构,失去了通用性,或压缩其令牌,牺牲了准确性。许多非ViT架构既快速又准确。然而,若不进行重大修改,它们无法做到ViT所能之事:处理其他输入形状、通过最先进的自监督学习进行预训练、通过丢弃令牌减少计算等。我们通过减少补丁令牌宽度,同时通过添加一个新的Jumbo令牌来增加全局令牌宽度,从而使ViT更快。我们更宽的Jumbo令牌由其自身更宽的前馈网络处理,以增加模型容量。然而,我们的Jumbo前馈网络是高效的:它处理单个令牌以保证速度,并且其参数在所有层之间共享以节省内存。至关重要的是,我们的Jumbo令牌仅使用注意力机制且非分层,就像一个朴素ViT,因此它简单、可扩展、灵活,并且与新旧ViT方法兼容。Jumbo在从Nano到Large的各个规模上改进了带有Registers的ViT基线,同时在ImageNet-1K上保持了速度/吞吐量(提升0.1-13%)。Jumbo还改进了分割(在ADE20K上提升1.9-3.1%)、MAE预训练(在ImageNet-1K上线性探测提升4.9%)、测试时适应(在ImageNet-C上提升5.2%)以及时间序列建模。我们的Jumbo模型甚至比专门的非ViT计算效率模型实现了更好的速度-准确性权衡,同时保持了朴素ViT的兼容性以利于实用。代码和权重已公开:https://github.com/antofuller/jumbo