Scaling up neural networks has been a key recipe to the success of large language and vision models. However, in practice, up-scaled models can be disproportionately costly in terms of computations, providing only marginal improvements in performance; for example, EfficientViT-L3-384 achieves <2% improvement on ImageNet-1K accuracy over the base L1-224 model, while requiring $14\times$ more multiply-accumulate operations (MACs). In this paper, we investigate scaling properties of popular families of neural networks for image classification, and find that scaled-up models mostly help with "difficult" samples. Decomposing the samples by difficulty, we develop a simple model-agnostic two-pass Little-Big algorithm that first uses a light-weight "little" model to make predictions of all samples, and only passes the difficult ones for the "big" model to solve. Good little companion achieve drastic MACs reduction for a wide variety of model families and scales. Without loss of accuracy or modification of existing models, our Little-Big models achieve MACs reductions of 76% for EfficientViT-L3-384, 81% for EfficientNet-B7-600, 71% for DeiT3-L-384 on ImageNet-1K. Little-Big also speeds up the InternImage-G-512 model by 62% while achieving 90% ImageNet-1K top-1 accuracy, serving both as a strong baseline and as a simple practical method for large model compression.
翻译:扩大神经网络规模一直是大型语言和视觉模型取得成功的关键方法。然而在实践中,放大后的模型在计算成本上可能不成比例地高昂,却仅带来性能的边际提升;例如,EfficientViT-L3-384在ImageNet-1K准确率上相比基础L1-224模型的提升不足2%,却需要14倍以上的乘积累加运算量。本文研究了图像分类任务中主流神经网络家族的缩放特性,发现放大后的模型主要对"困难"样本的处理有所助益。通过按样本难度进行分解,我们提出了一种简单的模型无关双阶段算法——Little-Big算法:首先使用轻量级"小"模型对所有样本进行预测,仅将困难样本交由"大"模型处理。优秀的小模型伴侣能在多种模型家族和规模下实现显著的MACs降低。在不损失精度或修改现有模型的前提下,我们的Little-Big模型在ImageNet-1K上实现了以下MACs缩减:EfficientViT-L3-384降低76%,EfficientNet-B7-600降低81%,DeiT3-L-384降低71%。Little-Big算法还将InternImage-G-512模型加速62%,同时达到90%的ImageNet-1K top-1准确率,既可作为强大的性能基准,也可作为大模型压缩的简易实用方法。