USDC: Unified Static and Dynamic Compression for Visual Transformer

Visual Transformers have achieved great success in almost all vision tasks, such as classification, detection, and so on. However, the model complexity and the inference speed of the visual transformers hinder their deployments in industrial products. Various model compression techniques focus on directly compressing the visual transformers into a smaller one while maintaining the model performance, however, the performance drops dramatically when the compression ratio is large. Furthermore, several dynamic network techniques have also been applied to dynamically compress the visual transformers to obtain input-adaptive efficient sub-structures during the inference stage, which can achieve a better trade-off between the compression ratio and the model performance. The upper bound of memory of dynamic models is not reduced in the practical deployment since the whole original visual transformer model and the additional control gating modules should be loaded onto devices together for inference. To alleviate two disadvantages of two categories of methods, we propose to unify the static compression and dynamic compression techniques jointly to obtain an input-adaptive compressed model, which can further better balance the total compression ratios and the model performances. Moreover, in practical deployment, the batch sizes of the training and inference stage are usually different, which will cause the model inference performance to be worse than the model training performance, which is not touched by all previous dynamic network papers. We propose a sub-group gates augmentation technique to solve this performance drop problem. Extensive experiments demonstrate the superiority of our method on various baseline visual transformers such as DeiT, T2T-ViT, and so on.

翻译：摘要：视觉Transformer在分类、检测等几乎所有视觉任务中均取得了巨大成功。然而，其模型复杂度与推理速度限制了在工业产品中的部署。现有多种模型压缩技术致力于在保持模型性能的同时直接压缩视觉Transformer为更小模型，但当压缩率较大时性能会急剧下降。此外，部分动态网络技术也被应用于在推理阶段动态压缩视觉Transformer，以获得输入自适应的高效子结构，从而在压缩率与模型性能间取得更优平衡。然而在实际部署中，由于需要将完整的原始视觉Transformer模型及附加的控制门控模块同时加载到设备上进行推理，动态模型的内存上限并未降低。为缓解这两类方法的不足，本文提出联合静态压缩与动态压缩技术，构建输入自适应的压缩模型，从而进一步优化总压缩率与模型性能间的平衡。此外，在实际部署中，训练和推理阶段的批尺寸通常不同，这会导致模型推理性能低于训练性能，而此前所有动态网络相关论文均未涉及该问题。我们提出子组门增强技术来解决这一性能下降问题。大量实验表明，本方法在DeiT、T2T-ViT等多种基线视觉Transformer上均具有优越性。