The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features), ViT-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT-22B demonstrates the potential for "LLM-like" scaling in vision, and provides key steps towards getting there.
翻译:Transformer的规模化推动了语言模型的突破性能力。目前,最大的大型语言模型(LLMs)参数已超过1000亿。视觉Transformer(ViT)将相同架构引入图像和视频建模,但尚未成功扩展到相近程度——最大密集ViT仅包含40亿参数(Chen et al., 2022)。我们提出了一种高效且稳定的训练方案,成功训练了220亿参数的ViT模型(ViT-22B),并基于该模型开展了广泛实验。在下游任务评估中(通常对冻结特征应用轻量线性模型),ViT-22B展现出随规模递增的性能提升。我们进一步观察到规模扩展带来的其他有益特性:公平性与性能之间更优的权衡、在形状/纹理偏向上达到与人类视觉感知对齐的最优表现,以及更强的鲁棒性。ViT-22B证明了视觉领域实现"类大语言模型"规模扩展的潜力,并为达成这一目标提供了关键步骤。