Deep neural networks have consistently represented the state of the art in most computer vision problems. In these scenarios, larger and more complex models have demonstrated superior performance to smaller architectures, especially when trained with plenty of representative data. With the recent adoption of Vision Transformer (ViT) based architectures and advanced Convolutional Neural Networks (CNNs), the total number of parameters of leading backbone architectures increased from 62M parameters in 2012 with AlexNet to 7B parameters in 2024 with AIM-7B. Consequently, deploying such deep architectures faces challenges in environments with processing and runtime constraints, particularly in embedded systems. This paper covers the main model compression techniques applied for computer vision tasks, enabling modern models to be used in embedded systems. We present the characteristics of compression subareas, compare different approaches, and discuss how to choose the best technique and expected variations when analyzing it on various embedded devices. We also share codes to assist researchers and new practitioners in overcoming initial implementation challenges for each subarea and present trends for Model Compression. Case studies for compression models are available at \href{https://github.com/venturusbr/cv-model-compression}{https://github.com/venturusbr/cv-model-compression}.
翻译:深度神经网络在大多数计算机视觉任务中持续保持着最先进的性能表现。在这些场景下,更大、更复杂的模型相较于小型架构展现出更优越的性能,尤其是在使用大量代表性数据进行训练时。随着近期基于视觉Transformer(ViT)的架构和先进卷积神经网络(CNN)的广泛采用,主流骨干架构的参数总量已从2012年AlexNet的6200万参数增长至2024年AIM-7B的70亿参数。因此,在具有处理和运行时限制的环境中部署此类深度架构面临诸多挑战,尤其是在嵌入式系统中。本文系统综述了应用于计算机视觉任务的主要模型压缩技术,这些技术使得现代模型能够在嵌入式系统中得以应用。我们阐述了各压缩子领域的技术特性,比较了不同方法的优劣,并讨论了如何在各类嵌入式设备上分析时选择最佳技术及预期性能变化。同时,我们提供了辅助代码以帮助研究人员和新从业者克服各子领域在初始实现阶段面临的挑战,并展望了模型压缩技术的发展趋势。压缩模型的案例研究可通过\href{https://github.com/venturusbr/cv-model-compression}{https://github.com/venturusbr/cv-model-compression}获取。