Neural network (NN) compression via techniques such as pruning, quantization requires setting compression hyperparameters (e.g., number of channels to be pruned, bitwidths for quantization) for each layer either manually or via neural architecture search (NAS) which can be computationally expensive. We address this problem by providing an end-to-end technique that optimizes for model's Floating Point Operations (FLOPs) or for on-device latency via a novel $\frac{\ell_1}{\ell_2}$ latency surrogate. Our algorithm is versatile and can be used with many popular compression methods including pruning, low-rank factorization, and quantization. Crucially, it is fast and runs in almost the same amount of time as single model training; which is a significant training speed-up over standard NAS methods. For BERT compression on GLUE fine-tuning tasks, we achieve $50\%$ reduction in FLOPs with only $1\%$ drop in performance. For compressing MobileNetV3 on ImageNet-1K, we achieve $15\%$ reduction in FLOPs, and $11\%$ reduction in on-device latency without drop in accuracy, while still requiring $3\times$ less training compute than SOTA compression techniques. Finally, for transfer learning on smaller datasets, our technique identifies $1.2\times$-$1.4\times$ cheaper architectures than standard MobileNetV3, EfficientNet suite of architectures at almost the same training cost and accuracy.
翻译:通过剪枝、量化等技术进行神经网络压缩时,需要为每层手动设置压缩超参数(例如剪枝通道数、量化位宽)或通过计算昂贵的神经架构搜索完成。本文提出一种端到端技术解决该问题,通过新颖的$\frac{\ell_1}{\ell_2}$延迟代理,直接优化模型的浮点运算次数或设备端延迟。该算法具有通用性,可与多种主流压缩方法(包括剪枝、低秩分解和量化)结合使用。关键优势在于其运算速度与单次模型训练几乎相当,相比标准NAS方法显著提升训练速度。在GLUE微调任务中压缩BERT时,我们仅以$1\%$的性能损失实现了$50\%$的FLOPs缩减;在ImageNet-1K上压缩MobileNetV3时,未损失精度即实现$15\%$的FLOPs缩减与$11\%$的设备端延迟降低,且训练计算量仅为当前最先进压缩技术的$1/3$。最后,在小型数据集迁移学习场景中,我们的方法能以几乎相同的训练成本与精度,识别出比标准MobileNetV3、EfficientNet架构族便宜$1.2\times$-$1.4\times$的架构。