Scaling laws have been recently employed to derive compute-optimal model size (number of parameters) for a given compute duration. We advance and refine such methods to infer compute-optimal model shapes, such as width and depth, and successfully implement this in vision transformers. Our shape-optimized vision transformer, SoViT, achieves results competitive with models that exceed twice its size, despite being pre-trained with an equivalent amount of compute. For example, SoViT-400m/14 achieves 90.3% fine-tuning accuracy on ILSRCV2012, surpassing the much larger ViT-g/14 and approaching ViT-G/14 under identical settings, with also less than half the inference cost. We conduct a thorough evaluation across multiple tasks, such as image classification, captioning, VQA and zero-shot transfer, demonstrating the effectiveness of our model across a broad range of domains and identifying limitations. Overall, our findings challenge the prevailing approach of blindly scaling up vision models and pave a path for a more informed scaling.
翻译:缩放定律近期被用于推导给定计算时长下计算最优的模型规模(参数量)。我们进一步推进并完善此类方法,以推断计算最优的模型形态(如宽度与深度),并将其成功应用于视觉Transformer。我们的形态优化视觉Transformer——SoViT,尽管预训练计算量与同等规模模型相当,但其性能可与规模超过自身两倍的模型相匹敌。例如,SoViT-400m/14在ILSRCV2012数据集上达到90.3%的微调精度,超越规模更大的ViT-g/14,并在相同设置下接近ViT-G/14,其推理成本却不足后者的二分之一。我们通过图像分类、图像描述、VQA及零样本迁移等多类任务进行深入评估,验证了模型在广泛领域中的有效性,同时识别其局限性。总体而言,我们的研究挑战了盲目扩大视觉模型的传统思路,为更科学地扩展模型开辟了新路径。