Scaling laws have been recently employed to derive compute-optimal model size (number of parameters) for a given compute duration. We advance and refine such methods to infer compute-optimal model shapes, such as width and depth, and successfully implement this in vision transformers. Our shape-optimized vision transformer, SoViT, achieves results competitive with models that exceed twice its size, despite being pre-trained with an equivalent amount of compute. For example, SoViT-400m/14 achieves 90.3% fine-tuning accuracy on ILSRCV2012, surpassing the much larger ViT-g/14 and approaching ViT-G/14 under identical settings, with also less than half the inference cost. We conduct a thorough evaluation across multiple tasks, such as image classification, captioning, VQA and zero-shot transfer, demonstrating the effectiveness of our model across a broad range of domains and identifying limitations. Overall, our findings challenge the prevailing approach of blindly scaling up vision models and pave a path for a more informed scaling.
翻译:缩放定律最近被用于在给定计算时长下推导计算最优模型大小(参数数量)。我们推进并细化了这些方法,以推断计算最优模型形状(如宽度和深度),并在视觉变换器中成功实现。我们的形状优化视觉变换器SoViT在预训练计算量相同的情况下,取得了与两倍以上规模模型竞争的结果。例如,SoViT-400m/14在ILSRCV2012上达到90.3%的微调准确率,在相同设置下超越规模更大的ViT-g/14并接近ViT-G/14,同时推理成本不到后者的一半。我们在图像分类、字幕生成、VQA和零样本迁移等多个任务上进行了全面评估,展示了模型在广泛领域的有效性并识别了局限性。总体而言,我们的研究结果挑战了盲目扩大视觉模型的现行方法,并为更明智的缩放铺平了道路。