The large-scale visual pretraining has significantly improve the performance of large vision models. However, we observe the \emph{low FLOPs pitfall} that the existing low-FLOPs models cannot benefit from large-scale pretraining. In this paper, we propose a general design principle of adding more parameters while maintaining low FLOPs for large-scale visual pretraining, named as ParameterNet. Dynamic convolutions are used for instance to equip the networks with more parameters and only slightly increase the FLOPs. The proposed ParameterNet scheme enables low-FLOPs networks to benefit from large-scale visual pretraining. Experiments on the large-scale ImageNet-22K have shown the superiority of our ParameterNet scheme. For example, ParameterNet-600M can achieve higher accuracy than the widely-used Swin Transformer (81.6\% \emph{vs.} 80.9\%) and has much lower FLOPs (0.6G \emph{vs.} 4.5G). The code will be released as soon (MindSpore: https://gitee.com/mindspore/models, PyTorch: https://github.com/huawei-noah/Efficient-AI-Backbones).
翻译:大规模视觉预训练已显著提升了大型视觉模型的性能。然而,我们发现现有低FLOPs模型存在“低FLOPs陷阱”,即其无法从大规模预训练中获益。本文提出一种名为ParameterNet的通用设计原则,旨在通过增加参数量的同时保持低FLOPs,以适用于大规模视觉预训练。具体地,我们采用动态卷积为网络增加参数,同时仅轻微增加FLOPs。所提出的ParameterNet方案使低FLOPs网络能够受益于大规模视觉预训练。在大型ImageNet-22K数据集上的实验展示了该方案的优势:例如,ParameterNet-600M在准确率上超越广泛使用的Swin Transformer(81.6% vs. 80.9%),同时FLOPs显著更低(0.6G vs. 4.5G)。相关代码将尽快发布(MindSpore: https://gitee.com/mindspore/models, PyTorch: https://github.com/huawei-noah/Efficient-AI-Backbones)。