Vision Transformers (ViT) have marked a paradigm shift in computer vision, outperforming state-of-the-art models across diverse tasks. However, their practical deployment is hampered by high computational and memory demands. This study addresses the challenge by evaluating four primary model compression techniques: quantization, low-rank approximation, knowledge distillation, and pruning. We methodically analyze and compare the efficacy of these techniques and their combinations in optimizing ViTs for resource-constrained environments. Our comprehensive experimental evaluation demonstrates that these methods facilitate a balanced compromise between model accuracy and computational efficiency, paving the way for wider application in edge computing devices.
翻译:视觉Transformer(Vision Transformers, ViT)已在计算机视觉领域引发范式转变,在各类任务中均超越了现有最优模型。然而,其实际部署受限于高昂的计算与内存需求。本研究通过评估四种主要模型压缩技术——量化、低秩近似、知识蒸馏与剪枝——来应对这一挑战。我们系统性地分析并比较了这些技术及其组合在优化资源受限环境下的ViT模型时的有效性。综合实验评估表明,这些方法可在模型精度与计算效率之间实现平衡折衷,为边缘计算设备中的广泛部署铺平道路。